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Abstract 

A popular approach within the signal processing and machine learning communities consists in mod- 
elling signals as sparse linear combinations of atoms selected from a learned dictionary. While this 
paradigm has led to numerous empirical successes in various fields ranging from image to audio process- 
ing, there have only been a few theoretical arguments supporting these evidences. In particular, sparse 
coding, or sparse dictionary learning, relies on a non-convex procedure whose local minima have not been 
fully analyzed yet. In this paper, we consider a probabilistic model of sparse signals, and show that, 
with high probability, sparse coding admits a local minimum around the reference dictionary generating 
the signals. Our study takes into account the case of over-complete dictionaries and noisy signals, thus 
extending previous work limited to noiseless settings and/or under-complete dictionaries. The analysis 
we conduct is non-asymptotic and makes it possible to understand how the key quantities of the problem, 
such as the coherence or the level of noise, can scale with respect to the dimension of the signals, the 
number of atoms, the sparsity and the number of observations. 
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1 Introduction 

Modelling signals as sparse linear combinations of atoms selected from a dictionary has become a popular 
paradigm in many fields, including signal processing, statistics, and machine learning. This line of research 
has wi t nesse d the development of several well-founded theoretical frame works (see, e.g., Wainwright |2009l |. 
Zhanj 2009| ) and efficient algorithmic tools (see, e.g.. Bach et al. |2011 and references therein). 

However, the performance of such approaches hinges on the representation of the signals, which makes the 
question of designing "good" dictionaries prominent. A great deal of effort has bee n dedicated to come up 
with efficient predefined dictionaries, e.g., the various types of wavelets [Mallat . 2008 1. These representations 
have notably contributed to many successful image processing applications such as compression, denoising 
and deblurring. More recently, the idea of simultaneously learning the dictionary and the sparse decom- 
positions of the signals — also known as sparse dictionary learning, or simply, sparse coding — has emerged 
as a powerful framewo rk, with state-of-the -art performance in many tasks, including inpainting and image 

classification (see, e.g., Mairal et al. 2010l | and references therein). 

Although sparse dictionary lea rning can sometim es be formulated as convex iBach et al. . 20081 iBradlev and Bagnell . 

2009| . non-parametric Bayesian |Zhou et aL . 2009| and submodular jKrause and Cevher , 2010( problems, 
the most popular and widely used definition of sparse coding brings into play a non-convex optimiza- 
tion problem. Despite its empirical and practical success, t here has only been little t heoretical analysis 



of the properties of sparse dictionary learning. For instance, Maurer and Pontil 2010| . Vainsencher et al 
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2010| . Mehta and Gray 2012| derive generalization bounds which quantify how much the expected signal- 
reconstruction error differs from the em pirical one, com puted fr om a random and finite-si ze sample of signals. 
In particular, the bounds obtained bv iMaurer and Ponti l [2010], Vainsencher et al. 2010l | arc non-asymptotic 
and uniform with respect to the whole class of dictionaries considered (e.g., those with normalized atoms). 
As discussed later, the questions raised in this paper explore a different and complementary direction. 

Another theoretical aspect of interest consists in characterizing the local minima of the optimization 
problem associated to sparse coding, in spite of the non-convexity of its formulation. This problem is closely 
related to the question of identifiability, that is, whether it is possible to recover a reference dictionary 
that is assumed to generate the observed signals. Identifying such a dictionary is important when the 
interpretat ion of the learned atom s matters, e.g., i n source localization Cqmon and Jutten . 2010l | or in topic 



modelling [Jenatton et al.l . 12011 



The authors of Gribonval and Schnasj 2010| pioneered research in this 
direction by considering noiseless sparse signals, possibly corrupted by some outliers, in the case where 



the reference dictionary forms a basis. Still in a noiseless setting, and without outliers, iGeng et al.l [2011 



extended the analysis to over-complete dictionaries, i.e., these composed of more atoms than the dimension 
of the signals. To the best of our knowledge, com parable analysis have not been carried out yet fo r noisy 
signals. In particular, the structure of the proofs of Gribonval and Schnasd |2010| . Geng et al. 2011 hinges 
on the absence of noise and cannot be straightforwardly transposed to take into account some noise; this 
point will be discussed subsequently. 

In this paper, we therefore analyze the local minima of sparse coding in the presence of noise and make 
the following contributions: 

- Within a probabilistic model of sparse signals, we derive a non- asymptotic lower bound of the probability 
of finding a local minimum in a neighborhood of the reference dictionary. 

- Our work makes it possible to better understand (a) how small the neighborhood around the reference 
dictionary can be, (b) how many signals are required to hope for identifiability, (c) what the impact of 
the degree of over-completeness is, and (d) what level of noise appears as manageable. 

- We show that under deterministic coherence-based assumptions, such a local minimum is guaranteed to 
exist with high probability. 



2 Problem statement 

We introduce in this section the material required to define our problem and state our results. 

Notation. For any integer p, we define the set [l;p]] = {1, . . . ,p}. For all vectors v € K^, we denote by 
sign(v) e { — 1,0, 1}P the vector such that its j-th entry [sign(v)]j is equal to zero if = 0, and to one 
(respectively, minus one) if Vj > (respectively, Vj < 0). We extensively manipulate matrix norms in the 
sequel. For any matrix A e M"^^, we define its Frobenius norm by ||A||p = Er=i ^ijV^'^l similarly, 

we denote the spectral norm of A by |||A|||2 = max|[x||,,<i ||Ax||2, and refer to the operator i!oo-norm as 
ll|A|||oo = maxj|x||„<i ||Ax||oo = maxigji.„| Y.%i 

For any square matrix B G R"^", we denote by diag(B) <E M" the vector formed by extracting the 
diagonal terms of B, and conversely, for any b € M", we use Diag(b) G R"^" to represent the (square) 
diagonal matrix whose diagonal elements are built from the vector b. For any mx p matrix A and index set 
J C |l;p] we denote by Aj the matrix obtained by concatenating the columns of A indexed by J. Finally, 
the sphere in Rp is denoted 5^ = {v e W; ||v||2 = 1} and 5^ = 5^ n R^ . 

2.1 Background material on sparse coding 

Let us consider a set of n signals X = [x^, . . . ,x"] G R™^" of dimension m, along with a dictionary D = 
[d^, . . . , dP] G formed of p atoms — also known as dictionary elements. Sparse coding simultaneously 

learns D and a set of n sparse p-dimcnsional vectors A = [a^, . . . , a"] gRp'^", such that each signal x' can 
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be well approximated by « Da' for i in |1; n]. By sparse, we mean that the vector a* has k <^ p non-zero 
coefficients, so that we aim at recons tructing x' from only a few atoms. Before introducing the sparse coding 
formulation Mairal et al. iiolO, Ol shausen and Field . [l997t . we need some definitions: 



Definition 1. For any dictionary D g 



'Cx(D,a) 
/x(D) 



mxp signal x e 
1„ 



x - 



min £x(D, ol) 



Similarly for any set of n signals X = [x"'^ , 



..,x" 



% we define 
f Allalli 

itroduce 



(1) 



1 " 

F„(D)^-^/xKD). 

T) ^ ^ 



Based on problem (|T1), re fered to as Lasso in statistics [Tibshirani Il996| . and basis pursuit in signal 



processing Chen et al. , 1998l | , the standard approach to perform sparse coding Olshausen and Fieldl . 1997 



Mairal et al.l . l20ld | solves the minimization proble: 



min F„(D), 

DG-D 



(2) 



where the rcgularization parameter A in ([T]) controls the level of sparsity, while T) C R'"'^?' is a compact set; 
in this pap er, T) denotes the set of dictionaries with un i t ^2-n orm atoms, which is a natural choice in image 
processing Mairal et al. . 2010l Gribonval and Schnassl 2010j . Note ho wever that other choic es for the set 
T> may also be relevant depending on the application at hand (see, e.g., iJenatton et al.l 201l| where in the 
context of topic models, the atoms in T) belong to the unit simplex). 



2.2 Main objectives 

The goal of the paper is to characterize some local minima of the function Fn under a generative model for 
the signals x*. Throughout the paper, we assume the observed signals arc generated independently according 
to a specified probabilistic model. The considered signals are typically drawn as x' = Doccg + where Dq 
is a fixed reference dictionary, cig is a sparse coefficient vector, and is a noise term. The specifics of the 
underlying probabilistic model are given in Sec. 12.61 Under this model, we can state more precisely our 
objective: we want to show that 

ViiFn has a local minimum in a "neighborhood" of Dq) ~ 1. 

We loosely refer to a certain "neighborhood" since in our regularized formulation, a local minimum cannot 
appear exactly at Dq. The proper meaning of this neighborhood is the subject of Sec. 12.31 



Intrinsic ambiguities of sparse coding. Importantly, we have so far referred to Dp as the referenc e 
dictionary g enera ting the signals. However, and as already discussed in I Gribonval and Schnasd |2010t . 



Geng et al.l 201 ij a nd more generally t he re lated literature on blind source separation and independent 



component analysis Comon and Juttenl . 2010l |. it is known that the objective of ([2|) is invariant by sign flips 
and atoms permutations. As a result, while solving ([2]), we cannot hope to identify the specific Dq. Wc focus 
instead on the local idcntifiability of the whole equivalence class defined by the transformations described 
above. From now on. we simply refer to Dq to denote one clement of this equivalence class. Also, since 
these transformations are discrete, our local analysis is not affected by invariance issues, as soon as we are 
sufficiently close to some representant of Dq. 



3 



2.3 Local minima on the oblique manifold 



The minimization of is carried out over V, which is the set o f dictionaries with unit ^2-iiorm atoms. This 
set turns out to be a manifold, known as the oblique manifold Absil et alj . l2008l | . Since Dq is assumed to 



belong to T), it is therefore natural to consider the behavior of F„ according to the geometry and topology 
of v. To this end, we consider a specific (local) parametrization of V. 



Parametrization of the oblique manifold. Specifically, let us consider the set of matrices 

Wdo = {W e K^^P; diag(W^Do) = and diag(W^W) = l}. 

In words, a matrix W G Wdo h^^s unit norm columns ||w-'||2 = 1 that are orthogonal to the corresponding 
columns of Dq: [w-'J^d^ = 0, for any j € Now, for any matrix W G Wdoj for any unit norm velocity 

vector V e 5^, and for all t G R, we introduce the parameterized dictionary: 

D(Do, W,v,i) = DoDiag[cos(vt)] + WDiag[sin(vi)], (3) 

where Diag[cos(vi)] and Diag[sin(vt)] G MP^^ stand for the diagonal matrices with diagonal terms equal 
to {cos(v,t)}jg[i.p| and {sin(y,t)}jg ji.p| respectively. By construction, we have D(Do,W,v, g V for all 
t S M and D(Do,W,v, 0) = Dq. To case notation, we will denote D(W,v,t), leaving the dependence on 
the reference dictionary Dq implicit. Also, when it will be made clear from the context, we will drop the 
dependence on W,v in D. Note that the set of matrices given by WDiag(v) corresponds to the tangent 
space of V at Dq, intersected with the set of matrices in R™'^p with unit Frobenius norm (since we have 
||WDiag(v)||p = 1). 



Characterization of local minima on the oblique manifold. We can exploit the above parametriza- 
tion of the manifold 2? to characterize the existence of a local minimum as follows: 

Proposition 1 (Local minimum characterization). Let t > be some fixed scalar and define 

Ai^„(W,v,t) ^ F„(D(W,v,i)) - F„(Do). (4) 



// we have 



inf AF„(W,v,t) >0, 



then Fn '■ T) ^ admits a local minimum in {D G V; ||Do - D||p < t}. 

The detailed proof of this result is given in Sec. A of the appendix. It relies on the continuity of Fn and 
the fact that the curves D(W,v,t) define a surjective mapping onto D (see Lemma 1 in the appendix). We 
next describe some other ingredients required to state our results. 



2.4 Closed-form expression for F„? 

Although the function Fn is Lipschitz-continuous Mairal et al. . 20ld |. its minimization is challenging since 
it is non-convex and subject to the non-linear constraints of V. Moreover, F„ is defined through the min- 
imization over the vectors A, which, at first sight, does not lead to a simple and convenient expression. 
However, it is known that Fn has a simple closed-form in some favorable scenarios. 



Closed- form expression for /x. We leverage here a key property of the function /x. Denote by ct G a 
solution of problem ([T}, that is, the minimization defining /x. By the convexity of the problem, there always 
exists such a solution such that, denoting J = {j G cij ^ 0} its support, the dictionary Dj G M™^!-'! 

restricted to the atoms indexed by J has linearly independent columns (hence DJDj is invertible). Denoting 
s G |— 1, 0, IjP the sign of a and J its support, a has a closed-form expression in terms of Dj, x and s (see, 
e.g., Wainwright 2009l |. Fuchsl |2005j ). This property is appealing in that it makes it possible to obtain a 



closed-form expression for /x (and hence, Fn), provided that we can control the sign patterns of a. In light 
of this remark, it is natural to define: 
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Definition 2. Let s G { — 1,0, 1}^ be an arbitrary sign vector and J he its support. For x € 
D e M™^P, we rfe/^ne 

1, 



<^x(D|s) ^ 



inf 



X Dal 



aeKP, support(Q;)=J 2 

Whenever DJDj is invertible, the minimum is achieved at a = q:(D,x, s) defined by 

aj = [DJDj] [Djx - Asj] G mI''I anrf aj. = 0, 

and we have 

= \ [Ml - (Djx - Asj)T(DjDj)-i(Djx - Asj)] . 



(5) 



Moreover, if aigii{a) = s, then 
<^x(D|s) = 



^llx - Dccll^ + As^a = min ^^(D, a) = £x(D, a). 

aeKP, sign(Q;)=s 



aeRP, sign(a)=s 2 

FFe define $„(D|S) analogously to _F'„(D), /or a sign matrix S G { — 1,0, 1}p^". 

Hence, with s the sign of the (unknown) minimizer a, we have /x(D) = 'Cx(D, a.) = (/!)x(D|s). 

Showing that the function Fn is accurately approximated by $„(-|S) for a controUcd S wiU be a key 
ingredient of our approach. This wiU exploit sign re covery prope r ties o f £i-regularize d least-squares problems, 
a topic which is already well- understood (see, e.g., Wainwrighd 2009l | . iFtichsl 2005 1 and references therein). 



2.5 Coherence assumption on the reference dictionary Dq 

We consider a st andard sufficient support r ecovery condition referred to as the exact recovery condition in 
signal processing |Fuchd . 20051 Tk'oppl 20o3 or the irrepresentabi lity condition (IC) in the machine learning 
and statistics communities |Wainwrightll2009l l IZhao and 2006l | . It is a key element to control the supports 
of the solutions of £i-regularized least-squares problems. To keep our analysis reasonably simple, we will 
impose the irrepresentability condition via a condition on the mutucil coh erence of the reference dictionary 
Do, which is a s tronger requireme nt Va,n de Geer and Biihlmann 2009| . This quantity is defined (see, 
e.g.. lFuchsl [2005| . lDonoho and Hud |200lj ) as 



^10= max |[dj,r[di^]|G[0,l]. 

The term /zq gives a measure of the level of correlation between columns of Dq. It is for instance equal to 
zero in the case of an orthogonal dictionary, and to one if Do contains two colincar columns. Similarly, we 
introduce /i(W,v,t) for the dictionary D(W,v, t) defined in jS]). For any W,v, t > 0, we have the simple 
inequality: 

/i(W, V, ^ max I [d' (W, v, t)]^ [d^ (W, v, t)] \ < /i(t) ^ /io + 3t. (6) 

In particular, we have ^(W,v, 0) — /io- For the theoretical analysis we conduct, wc consider a determin- 
isti c coherence - based assumption, as considered for instance in the previous work on dictionary learning 
by iGeng et al.l 2011 1 , such that the coherence no and the level of sparsity k of the coefficient vectors a* 
should be inversely proportional, i.e., kfig = 0(1). In light of ([6]), such an upper bound on will loosely 
transfer to /x(t) provided that t is small enough. In fact, and as further developed in the appendix, most 
of the elements of our proofs work bas ed on a restricted isometry propert y (RIP), which is known to be 
weaker than the coherence assumption Van de Geer and Biihlmannl . l2009j . However, since we still face a 



problem related to IC when using RIP, we keep the coherence in our analysis. Unifying our proofs under a 
RIP criterion is the object of future work. 
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2.6 Probabilistic model of sparse signals 



Equipped with the main concepts, we now present our signal modeL Given a fixed reference dictionary 
Dq € V, each noisy sparse signal x € R™ is built independently from the following steps: 

(1) Support generation: Draw uniformly without replacement k atoms out of the p available in Dq. This 
procedure thus defines a support J = {j G 5{j) = 1} whose size is |J| = k, and where S{j) denotes the 
indicator function equal to one if the j-th atom is selected, zero otherwise, so that 

E[5{j)] = |, and for i ^ j, we further have E[5{j)5{t)] = 

Our result holds for any support generation scheme yielding the above expectations. 

(2) Coefficient generation: Define a sparse vector ao G MP supported on J whose entries in J are generated 
i.i.d. according to a sub-Gaussian distribution: for j not in J, [ckqIj is set to zero; on the other hand, we 
assume there exists some c > such that for j E J we have, for all t e M, E{cxp(t[ao]j)} < exp(c^t^/2) 
. We denote CTq the smalles t value of c such that this property holds. For background about sub-Gaussian 
random variables, see, e.g., Buldvgin and Koza chcnko 2 00ol |. For simplicity of the analysis we restrict to 



the case where the distribution also has all its mass bounded away from zero. Formally, there exist a > 
such that Pr(|[Q;o]j| < a | j € J) = 0. 

(3) Noise: Eventually generate the signal x = Dqcko + £, where the entries of the additive noise e G R™ are 
assumed i.i.d. sub-Gaussian with parameter a. 



3 Main results 

This section describes the main results of this paper which show that under appropriate scalings of the 
dimensions (m,p), number of samples n, and model parameters A:, a, CTq, cr, /io, it is possible to prove that, 
with high probability, the problem ([2]) admits a local minimum in a neighborhood of Dq of controlled size, 
for appropriate choices of the regularization parameter A. The detailed proofs of the following results may 
be found in the appendix, but we provide their main outlines in Sec. |Bl 

Theorem 1 (Local minimum of sparse coding). Let us consider our generative model of signals for some 
reference dictionary Dq G with coherence fiQ, and define 1/7do = IIID0III2 ' ^MO; where |||Do|||2 refers to 

the spectral norm of Dq . // the following conditions hold: 



( Coherence) n ( ^log(p)) =70,, = O ( ^\og{n)) , 
(Sample complexity) = O' ^ 



m-p^ ■ 72 



then, with probability exceeding 1 — \^^^^\ ~ e ^V"^ problem (0j admits a local minimum in 

|d e P; ||Do - D||f = 0(^max|p • 70^ • e + \Jmp\og{n)/n , ^ • V^}) 

First, it is worth noting that this theorem is presented on purpose in a simplified form, in order to 
highlight its message. In particular, all quantities related to the distribution of ckq (e.g., CTq) are assumed 
to be 0(1) and are therefore kept "hidden" in the big-0 notation. A detailed statement of this theorem is 
however available in the appendix (see Theorem [3]). 

In words, the main message of Theorem [T] is that provided (a) the reference dictionary is incoherent 
enough, and (b) we observe enough signals, we can guarantee the existence of a local minimum for problem ([2]) 
in a ball centered at Dq. We can see that the radius of this ball decomposes according to three different 
contributions: (1) the coherence of Dq, via the term 7j3„, (2) the number of signals, and (3) the level of 
noise. These three factors limit the possible resolution we can guarantee. 

While a coherence condition scaling in k^LQ = 0(1) is standard for sparse models (see, e.g.. lFuchs (2005| ). 



we impose a slightly more conservative constraint in 0{1 / ^J\og{p)) . A typical example for which our result 
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applies is the Hadamard-Dirac dictionary built as the concatenaton of a Hadamard matrix and the identity 
matrix. In this case, we have p = 2m, |||Do|||2 = 'v/Z, and //q = 1/\A^ with k = 0{\/ raj log(2m)). In 
Sec. [SI wc use such over-complete dictionaries for our simulations. In addition, observe that because of the 
upperbound on , Theorem [T] does not handle per se the case of orthogonal dictionary, which we remedy 
in Theorem [21 

Perhaps surprisingly (and disappointingly), our result indicates that, even in a low-noise setting with 
sufficiently many signals (i.e., the asymptotic regime in n). we cannot arbitrarily lower the resolution of 
the local minimum because of the coherence /^o- In fact, the term e °o' is a direct consequence of 
our proof technique which relies on exact recovery. It is however worth noting that, since e °o' decreases 
exponentially fast in ^t>q-, the dependence on ^0 is quite mild (e.g., for a radius t, we have a constraint scaling 
in IIID0III2 • fc/^o = 0(l/\/log(l/T))). We next state a complementary theorem for orthogonal dictionaries 
where the radius is not constrained anymore by the coherence: 

Local correctness of sparse coding with orthogonal dictionaries: If we now assume that Do is 
orthogonal (i.e., //q = and p = m with |||Do|||2 = 1)7 we obtain the following result: 

Theorem 2 (Local minimum of sparse coding — Orthogonal dictionary). Let us consider our generative 
model of signals for some reference, orthogonal dictionary Dq G M™^™. If we have: 

log'(n)_^/ 1 



( Sample complexity ) = O 



n 



2 



minimum m 



then, with probability exceeding 1 — [-^^ij-^] ^ — e "^v"^ problem 0) admits a local 

jDel?; IIDo-DIIf = o( max jjTi • log(7i) • ( Vlog(n) + m)/V^, ^ ' |- 

Interestingly, we observe in this case that, given sufficiently many signals, we can localize arbitrarily well 
(up to the noise level) the local minimum around Dq. We now discuss relations with previous work in the 
noiseless setting. 

Local correctness of sparse coding without noise: If wc remove the noise from our signal model, 
i.e., (7 = 0, the result of The orems [l][2l remains unchanged , except that the radius is not limited anymore 



by -^^fm. Wc mention that Gribonval and Schnass 20ld | obtain a sample complexity in 0(p^log(p)) in 



the noiseless and square dictionary setting, while the result of iGeng et al. I j201lj leads to a scaling in 0{p^) 
(assuming both k = 0(1) and IDolj = 0(1)) in the noiseless, over-complete case. In comparison, our 
analysis suggests a sample complexity in 0{mp^). 

These discrepancies are due to the fact that we want to handle the noisy set ting; this has led us to conside r 
a scheme of proof radically different from those proposed in the related work Gribonval and Schnasd 2010 L 



Geng et al. 2011 1. In pa rticular, our formulation in problem © differs from that of Gribonval and Schnasd 



2010l |. lGeng et al.l |2011j where the £i-norm of A is minimized over the equality constraint DA = X and the 
dictionary normalization D G 2?. Optimality is then characterized through the linearization of the equality 
constraint, a technique that could not be easily extended to the noisy case. We next discuss the main 
building blocks of the results and give a high-level structure of the proof. 



4 Architecture of the proof of Theorem [T] 

Our proof strategy consists in using Proposition[l] that is, controlling the sign of AF„(W, v, t) defined in ([8|). 
In fact, since we expect to have for many training samples the equality /x(D(W, v, t)) = 0x(D(W, v, t)|sign(Q:o)) 
uniformly for all (W, v) , the main idea is to first concentrate on the study of the smooth function 

A$„(W,v,t) ^ $„(D(W,v,t)|sign(Ao)) - $„(Do|sign(Ao)), (7) 

instead of the original function AF„(W,v,t). 
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Control of A$„: This first step consists in uniformly lower bounding A$„ with high probability. As op- 
posed to AFn, the function A<i>„ is available explicitly, see ([2]) and (fT8)) . and corresponds to bilinear/quadratic 
forms in {ao,sign{cy.o), e) which we can concentrate around their expectations. Finally, the uniformity with 
respect to (W, v) is obtained by a standard e-net argument. 



Control of AFn via A$„: The second step consists in lower bounding AF„ in terms of A$„ uniformly for 
all parameters (W,v) G Wdq ^ S^- For a given t > 0, consider the independent events {'S'coincidG(^)}i6li;nl 
defined by 

/,,.(^)(D(W,v,i)) =</'x.M(D(W,v,i)|so), V(W,v)e Wdo x5f}, 

with So — sign(Q:o). In words, the event i^^coincidc(^) corresponds to the fact that target function /xi(ci;)(D(-, •, i)) 
coincides with the idealized one (p-^i/^^-f {!){■, •,t)|so) for the "radius" t. 

Importantly, the event iS'coincido(^) only involves a single signal; when we consider a collection of n in- 
dependent signals, we should instead study the event 0"=! ^coincide (^) *o guarantee that $„ and F„ (and 
therefore, A$„ and AF„) do coincide. However, as the number of observations n becomes large, it is unreal- 
istic and not possible to ensure exact recovery both simultaneously for the n signals and with high probability. 
To get around this issue, we seek to prove that AF„ is well approximated by A$„ (rather than equal to it) 
uniformly for all (W,v). We show that, when /xi(D(i)) and 0x>(D(t)|so) do not coincide, their difference 
can be bounded, and we obtain: 

AF„ ( W, V, i) > A$„ (W, V, t) - r„ . 
where we detail the definition of the residual term 

1 " 

2=1 

In the appendix, we show that with high probability: r„ = 0{[t^ ■ + 2m ■ + 2\kaa\ ■ (3 — log k)h) with 
K = maxig[i.„| Pr([£^oincidc(0 ^ 'S'coincido(O)] To bound the size of r„, we now control k. 



^coincide (^) — \^ 



Control of k, exact sign recovery for perturbed dictionaries: We need to determine sufficient con- 
ditions under which 0x(D(W, v, t)|sign(Q:o)) and /x(D(W, v, t)) coincide for all (W,v), and control the 
probability of this event. As briefly exposed in Sec. 12.11 it turns out that this question comes down to 
studying exact recovery for some ^i-regularized least-squares problems. Exact sign recovery i n the problem 
associ ated with /x(Do) has already been well-studied (see, e.g.. lWainwrig ht [2009], Fuchs [2005]. lz'hao and Yu 
[2006l |). However, in our context, we need the same conclusion to hold not only at the dictionary Dq, but 
also at D(W,v,i) ^ Dq uniformly for all parameters (W,v). It turns out that going away from the refer- 
ence dictionary Dq acts as a second source of noise whose variance depends on the radius t. We make this 
statement pre cise in Propositions 2-3 in the supplementary material. These results are in the same line as 
Theorem 1 in lMehta and Gravl |2012 |. 



Discussing when the lower-bound on AF„ is positive: With all the previous elements in place, we 
have a lower-bound for infweWDo ve5p Ai^„(W, v, t), valid with high probability. It finally suffices to discuss 
when it is stricly positive to conclude with Proposition [T] 



5 Experiments 

We illustrate the results from Sec. [3l Although we do not manage to highlight the exact scalings in {p, m) 
which we proved in Theorems [TJ21 our experiments still underline the main interesting trends put forward 
by our results, such as the dependencies with respect to n and a. 
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Throughout this section, the non-zero coefficients of ccq are uniformly drawn with |[ao]j| G [0.1, 10] and 
the noise follows a standard Gaussian distribution with variance a. We detail two important aspects of the 
experiments, namely, the choice of A, and how we deal with the invariance of problem ^ (see Sec. 12. 2p . 
Since our analysis relies on exact recovery, we first tune A over a logarithmic grid to match the oracle sparsity 
level. Note that this tuning step is performed over an auxiliary set of signals. On the other hand, we know 
that the dictionary D that we learn by minimizing problem ^ may differ from Dp up to sign flips and atom 
permutations. Since both D and Dp have normalized atoms, finding the closest dictionary (in Frobcnius 
norm) up to these transformations is equivalent to an assignment problem based on the absolute correlation 



matrix D^Dq, which can be efficiently solved using the Hunga rian algorithm Kuhnl . Il955 



To solve problem ^ , we use the stochastic algorithm from Mairal et al. |201Cll ^where the batch size is 
fixed to 512, while the number of epochs is chosen so as to pass over each signal 25 times (on average). We 
consider two types of initialization, i.e., either from (1) a random dictionary, or (2) the correct Dq. 

To begin with, we illustrate Theorem [T] with Dp a Hadamard-Dirac (over-complete) dictionary. The 
sparsity level is fixed such that |||Do|||2 ■ ^Mo = 0(l/\/log(p)), and we consider a small enough noise level, so 
that the radius is primarily limited by the number n of signals. The normalized error ||Do — D||f/\/top'^ 
versus n is plotted in Fig. [TJ We then focus on Theorem [2l with Dq a Hadamard (orthogonal) dictionary. 
We consider sufficiently many signals {n = 75,000) so that the radius is only limited by y/m ■ ct/ctq. The 
normalized error ||Do — D||f/v^ versus the level of noise is displayed in Fig. [TJ 




Figure 1: Normalized error between Dq and the solution of problem ([5]), versus the number of signals (left) 
and the noise level (right). The curves represent the median error based on 5 runs, for random and oracle 
initializations. More details can be found in the text; best seen in color. 



The curves represented in Fig. [T] do not perfectly superimposed, thus implying that our results do not 
capture the exact scalings in (p, m) (our bounds appear in fact as too pessimistic). However, our theory seems 
to account for the main dependencies with respect to n and a, as the good agreement with the predicted 
slopes proves it. Interestingly, while we would expect the curves in the left plot of Fig. [1] to tail off at some 

— ^ /2 

point because of the coherence (term e '^"o ' in the bound of the radius) , it seems that there is in practice 
a much milder dependency with respect to the coherence. Finally, we can observe that both the random 
and oracle initializations seem to lead to the same behavior, thus raising the questions of the potential global 
characterization of these local minima. 

^The code is available at http://www.di.ens.fr/willow/SPAMS/. 
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6 Conclusion 



Wc have conducted a non-asymptotic analysis of the local mini ma of sparse coding in the pres e nce of noise 
thus extending prior work which focused on noiseless settings [Gribonval and Schnass . 2010l Geng et al 



201 Ij . Within a probabilistic model of sparse signals, we have shown that a local minimum exists with high 
probability around the reference dictionary. 

Our study can be further developed in multiple ways. On the one hand, while we have assumed de- 
terministic c oherence-based condition s scaling in 0{l/k), it may interesting to consider non-deterministic 
assumptions (Candes and Plan . 2009| . which are likely to lead to improved scalings. On the other hand, we 



may also use more realistic generative mo dels for Qq, for in s tance , spike and slab models [Ishwaran and Rao 



2005l | , or signals with compressible priors [Gribonval et al.l 12011 1 . 



Also, we believe that our approach can handle the presence of outliers, provided their total energy remains 
small enough; we plan to make this argument formal in future work. 

Finally, it remains challenging to extend our local properties to global o nes due to t he in trinsic non- 
convexity of the problem; an appropriate use of convex relaxation techniques Bach et al. , 2008l | may prove 
useful in this context. 
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A Detailed Statements of the Main results 

We gather in this appendix the detailed statements and the proofs of the simplified results presented in the 
core of the paper. In particular, we show in this section that under appropriate scalings of the problem 
dimensions {m,p), number of training samples n, and model parameters k,a, a a, c, /io, it is possible to prove 
that, with high probability, the problem of sparse coding admits a local minimum in a certain neighborhood 
of Do of controlled size, for appropriate choices of the regularization parameter A. 



A.l Minimum local of sparse coding 

We present here a complete and detailed version of our result upon which the theorems presented in the 
paper are built. 

Theorem 3 (Local minimum of sparse coding). Let us consider our generative model of signals for some 
reference dictionary Dq G M'"^?' with coherence fiQ. Introduce the parameters qa = ^^^r^ and Qa = j'^E[|i|] 
which depend on the distribution of cxq only. Consider the following quantities: 

a 1 Qa 



T = r Do, ao) = mm<^ — , — ■ \ 

L(Ta 3co fc Do 2-* 



7 = 7(n, Do,Q:o) = \ min \^J2 log(n) , — J- ^° 



2 |v 6v ^'2^coc^ III Do III 2 -/^Moj' 

and let us define the radius t G R_|_ by 



t = max 



< --p-^ci^^e ^ + 2c2 • 7 • mp >, — ■ ^Jm> 

I qa ^ I n i ) ffa ) 



for some universal constants c* . Provided the following conditions are satisfied: 

1 Qa 



(Coherence) IIID0III2 ' ^"Mo < 



4V2coc^ ^/log(69cic2 • ^ • f ) ^ 



log(n) q"^ 1 
(Sample complexity) — - — < — • ■ —, 



C3 m ■ p^ 7^ 
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one can find a regularization parameter X proportional to ^ ■ Ua ■ t, and with probability exceeding 



nnpn 
\~9~ 



there exists a local 



minimum m 



{dgD; |1Do-D||p <t}. 



As it will discussed at greater length in Sec. [B] we can see that the probability of success of Theorem [3] 
can be decomposed into the contributions of the concentration of the surrogate function and the residual 
term. We next present a second result which assumes a more constrained signal model: 

Theorem 4 (Local minimum of sparse coding with noiseless/bounded signals). Let us consider our gen- 
erative model of signals for some reference dictionary Dg S R™'^?' with coherence fiQ. Further assume that 
ao is almost surely upper bounded by a and that there is no noise, that is, (7 = 0. Introduce the parameters 



9" ~ a-E[k 



and Qo 



which depend on the distribution of ag only. Consider the radius t £ K+ ; 

1/2 



t^ 



SClCA 



kmp' 



3 log(") 



for some universal constants . Provided the following conditions are satisfied: 

1 



( Coherence ) 



logf/i) 

(Sample complexity) < 



D0III2 -fc^/'Mo < 
1 



k'^mp^ 



cqCx 



qa 



ZiCl 



a 



1 



Qa 



(Ta'Sco /C- III Do III 2 



one can find a regularization parameter A proportional to \fk ■ a ■ t, and with probability exceeding 

'mpn\-"^p/^ 



1 
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there exists a local minimum in |d G ||Do — D||f < t|. 

These two theorems, which are proved in Section |BJ heavily relies on the following central result. 



A. 2 The backbone of the analysis 

We concentrate on the result which constitutes the backbone of our analysis. Indeed, we next show how the 
difference 

Ai^„(W, V, t) ^ F„(D(W, V, t)) ~ F„(Do). (8) 

is lower bounded with high probability and uniformly with respect to all possible choices of the parameters 
(W,v). The theorem and corollaries displayed in the core of the paper arc consequences of this general 
theorem, discussing under which conditions/scalings this lower bound can be proved to be sufficient (i.e., 
strictly positive) to exploit Proposition 1 and conclude to the existence of a local minimum for t appropriately 
chosen. We define 



1 



and Cf 



1 



v/1 - kfiit) ^l-5ki-Do)-t 
where the quantity (5fe(Do) is the RIP constant itself defined in Section ICl 



(9) 



Theorem 5. Let a, a a he the parameters of the coefficient model. Consider Dq a dictionary in R™^p with 
Ho < 1/2 and let k, t > be such that 



kH.{t) 
3t 



< 1/2 
4a 
^ 9^ 



(10) 
(11) 
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Then for small enough noise levels a one can find a regularization parameter A > such that 

4 



2 - ^ " - - 9 

Given a and A satisfying ilS^) . we define 



2 v/iV2 + TOa2 < A < -a. (12) 



A ^i^-Qt) > ^2l^. (13) 

Let x' G M™, i G [l;7i]l, where n/\ogn > mp, be generated according to the signal model. Then, except with 
probability at most (■^^^) ™^ ^ -|- exp(— 4n • e~'^ ) we have 

inf AF„(W,v,i) > (l-^2).E[a| fc^^2 

WGWdo.vGSp 2 P 

/ 16 \ fc 

'Qt [jQt + 3j • E{|ao|} • i • - • IIID0III2 • k^t) ■ A 

-B.f^, (14) 
V n 

w/iere 

/C ^ a- (|||Do|||2- v^ + t) (15) 

A ^ 367 • (tV^ _|_ 2mcr2 + 2Afccr„) (16) 

B = 3045 {kal ■ t + 2ma^ + 2Afccr„) . (17) 

Roughly speaking, the lower bound we obtain can be decomposed into three terms: (1) the expected value 
of our surrogate function valid uniformly for all parameters (v, W), (2) the contributions of the residual term 
(discussed in the next section) introducing the quantity 7, and (3) the probabilisitc concentrations over the 
n signals of the surrogate function and the residual term. 

The proof of the theorem and its main building blocks are detailed in the next section. 

B Architecture of the proof of Theorem [5] 

Since we expect to have for many training samples the equality /x(D(W,v,t)) = 0x(D(W, v, t)|sign(ao)) 
uniformly for all (W, v) , the main idea is to first concentrate on the study of the smooth fmiction 

A(f>„(W, V, t) ^ $„(D(W, V, i)lsign(Ao)) - $„(Do|sign(Ao)), (18) 

instead of the original function Ai^„(W, v, t). 

B.l Control of A$„ 

The first step consists in uniformly lower bounding A$,i with high probability. 
Proposition 2. Assume that kfi(t) < 1/2 then for any n such that 

n 

> mp, (19) 

logn 



14 



except with probability at most (^^^^^) "^^^^ ^yg have 



inf A$„(W,v,t) > (l-IC').Mi.^.e 



Z p 

-Ql-t-y IIID0III2 • fc/^W • A • {AQlX + 3E{|ao|}) 



■ \ mp 



logn 



(20) 



where 



JC ^ a.(|||Do|||2-v^ + t) 

B = 3045 {kal ■ t + 2ma^ + Xka^ + A^/c • t) 

The proof of this proposition is given in Section m 



B.2 Control of AF„ in terms of A$„ 

The second step consists in lower bounding /S.Fn in terms of A$„ uniformly for all (W, v) e Wdq ^ 5''. For 
a given t > 0, consider the independent events {i^^coincidc(0}ieli;nl defined by 

(D(W, V, t)) = 0,.(„)(D(W, V, <)|so), V(W, v) e Wdo x 5^ } , 

with So = sign(ao)- In words, the event iS'coincidc(^) corresponds to the fact that target function fx^ui) (D(-, i)) 
coincides with the idealized one 0x'(cj)(D(-, •,i)|so) for the "radius" t. 

Importantly, the event '^^coincide(^) o^-^Y involves a single signal; when we consider a collection of n indepen- 
dent signals, we should instead study the event 0"=! '^coincidc(^) guarantee that $„ and Fn (and therefore, 
A$„ and AFn) do coincide. However, as the number of observations n becomes large, it is unrealistic and 
not possible to ensure exact recovery both simultaneously for the n signals and with high probability. 

To get around this issue, we will relax our expectations and seek to prove that AFn is well approximated 
by A(f>„ (rather than equal to it) uniformly for all (W,v). This will be achieved by showing that, when 
/xi(D(t)) and 0xi(D(t)|so) do not coincide, their difference can be bounded. For any D £ M."^^p, we have 
by the very definition ([T]), < /x(D) < £x(D, Q^o)- We have as well by the definition 

< (/'x(D|sign(ao)) < min i||x - Dajlj + A • sign(Q;o)^Q; < £x(D, ao). 

a£]RP, sign(a)— sign(ao) ^ 

It follows that for aU (W,v) € Wdo x Sp we have, with D = D(W,v,t), 

<^x(Do|so) - 0x(D|so) + /x(D) - /x(Do) > -<^x(D|so) - /x(Do) > - {C^(D, ao) + C^{T>o, ao)} . 

When both functions coincide uniformly at radius t (the event fcoincidc(0 holds) and at radius zero (0x(Do|so) ~ 
/x(Do), i.e., the event f coincide (0) holds), the left hand side is indeed zero. As a result we have, uniformly 
for aU (W,v) e Wdo x 5^: 

/x'(D)-/xKDo) > 0x(D|so)-0x(Do|so)-rx., 

with rx. ^ ^[£^„(t)n£^„(0)]^(^) ■ {^x.(D,a^) + /:x»(Do,aj,)} . 

Averaging the above inequality over a set of n signals, we obtain a similar uniform lower bound for AF„: 

Ai^„(W,v,0 > A$„(W,v,t)-r„. (21) 



'^'coincidc(*) — j'^ 
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where we detail the definition 



rn{uj) = -Xl^Ro.„c.dc(*)n£coi„cMc(o)l=(^) ' {^x-(D, "o) + '^x'(Do, "o)} • 



(22) 



i=l 



Using Lemma |23I and CoroUaryHlin the Appendix, one can show that with high probabihty: 
r„ < 25 (t^ ■al+2m-a^ + 2\k(7a) (1 + log 2) • (3 - log k)k 



with K = maxjg|i.„] Pr([£*, 



coincide 



coincide 



(0)] ''). To bound the size of the residual r„, we now control n. 



B.2.1 Control of k: exact sign recovery for perturbed dictionaries 

The objective of this section is to determine sufficient conditions under which </)x(D(W, v, t)|sign(ao)) and 
/x(D(W,v,i)) coincide for all (W,v), and control the probability of this event. We make this statement 
precise in the following proposition, proved in Appendix iFl 

Proposition 3 (Exact recovery for perturbed dictionaries and one training sample). Condider Dp a dic- 
tionary in M™^P and let k,t such that kfj,{t) < 1/2. Let a, CTq,,ct be the remaining parameters of our signal 
model, and let x G R™ be generated according to this model. Assume that the regularization parameter A 
satisfies 



We also need a modified version of this proposition to handle a simplified, noiseless setting where the 
coefficients otQ are almost surely upper bounded. Its proof can be found in Section |F] as well. 

Proposition 4 (Exact recovery for perturbed dictionaries and one training sample; noiseless and bounded 
cxq). Condider Dq a dictionary in M™^p and let k,t such that A;/i(t) < 1/2. Consider our signal model with 
the following additional assumptions: 



Let X e R™ be generated according to this model. Assume that the regularization parameter A satisfies 



Consider <t' <t. Almost surely, we have, uniformly for all (W,v) S Wdo ^ , the vector a{t') € M.P 
defined by 



< A < -Q. 
- 9- 

Consider < t' < t. Except with probability at most 





(7 = (no noise) 

Pr(|[Q:o]j| > oi\j G J) = 0, for some a > a > {signal boundedness) . 
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B.2.2 Control of the residual 

The last step of the proof of Theorem [S] consists in controlhng the residual term (22). Its proof can be found 
m Section El 

Proposition 5. Let a,aa be the parameters of the coefficient model. Consider Do a dictionary in R™^p 
with Ho < 1/2 and let k,t be such that 

kfi{t) < 1/2 (23) 
3t 4q 

2^ < 9^ ^^^> 
Then for small enough noise levels a one can find a regularization parameter A > such that 

4 



Given a and A satisfying (|25p . we define 

A(2 - Q2) 



• V^^af+W2 < A < -a. (25) 



7= 7*^— >y^. (26) 

V5 • V< CT^ + TOO- 

-Let x' G M™, i e |l;n| 6e generated according to our noisy signal model. Then, 

Tn < {fcrl + 27710-2 + 2AA:cr„) • 367 • 7^ • e-^\ (27) 
2 

except with probability at most exp(— 4ri, • ). 

We have stated the main results and showed how they are structured in key propositions, which we now 
prove. 



A Proof of Proposition 1 

The topology we consider on T> is the one induced by its natural embedding in M^^P: the open sets are 
the intersection of open sets of IR™^p with T>. Recall that all norms are equivalent on M^^p and induce 
the same topology. For convenience we will consider the balls associated to the Froebenius norm. To prove 
the existence of a local minimum for F„, say at D*. we will show the existence of a ball centered at D*, 
Bh = {D e X>; ||D* - D||p < h] such that for any D G Bh, we have F„(D*) < F„(D). 



First step: We recall the notation =5^0 M?j^ for the sphere intersected with the positive orthant. 
Moreover, we introduce 

2t^{D(W,v,t'); We >VDo,ve5^,t'e [0,t], and i'!|vi|oo < ^}. 

The set Zt is compact as the image of a compact set by the continuous function (W, v, t') n- D(W, v, t'). As 
a result, the continuous function F„ admits a global minimum in Zt which we denote by D* — D(W*, v*, t*). 
Moreover, and according to the assumption of Proposition 1, we have t* < t. 



Second step: We will now prove the existence of /i > such that Bh C Zt. This will imply that 
F„(D*) < i^„(D) for D e Bh, hence that D* is a local mimimum. First, we formalize the following lemma. 
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Lemma 1. Given any matrix Di G V, any matrix D2 G V can be described as D2 = D(Di, W, v, r), with 
W G Wdu V € iS^j!. anrf t > smc/i that t||v||oo < ti"- Moreover, we have 

|rv, < ||d^2-d{||2 = 2sm(^) <rv,, Vj, (28) 
2 

-T < IID2 - DiIIp < r. (29) 

TT 

Vice-versa, Di = D(D2, W, v', t') /or some W e yVD2; with the same v' = v G S^, r' = r > 0. 

Proof. The resuh is trivial if D2 = Di, hence wc focus on the case D2 7^ Di. Each column of D2 can be 
uniquely expressed as 

dj = u + z, with u e span(d{) and u^z = 0. 
Since ||d2||2 = 1, the previous relation can be rewritten as 

= cos(6'j)d{ + sin(6lj)w', 

for some 9j € [0,7r] and some unit vector w-* orthogonal to d\ (except for the case 9j =0, the vector w-' is 
unique). The sign indetermination in w-' is handled thanks to the convention sm{6j) > 0. One can define a 
matrix W <E Wdi which j-th column is w-'. Denote 6 = {di,...,9p) and r = \\9\\2. Since D2 7^ Di we have 
T > and we can define v G 55. with coordinates 




Next we notice that t||v||oo = Halloo < tt and 

l|d^2 - diWl = 11(1 - cos(v,r))d-'' - sin(v,r)w^ = (1 - cos(v,r))2 + sin2(v,r) 
= 2(1 - cos(vjr)) = 4sin2(vjT/2). 

We conclude using the inequalities - < < 1 for < u < tt/2 and the fact that ||v||2 = 1. The 

reciprocal Di = D(D2, W, v', r') is obvious, and the fact that v' = v, r' = r follows from the equality 
||d{ - d^|l2 = 2sinvjr = 2sinv^T' for all j. □ 

Using the parameterization built in Lemma[I]for D G Bh, there remains to prove that D = D(Do, W, v, r) 
belongs to Zt provided that h is small enough. For that, we need to show that t < t (we will need of course 
to assume that h is small enough). To this end, notice that 

||D* = j2 - cos(v,r))d^o + sin(v^^t'^)w'^'^- - sin(v,r)w^ ||^ 

p 

= 2^(1 - cos(v|t^) cos(vjr) - sin(v*<^) sin(vjT)[wJy w^^) 



where the simplifications in the second equality come from the fact that both W and W* have their columns 
normalized and orthogonal to the corresponding columns of Dq. Since t'^||v*||oo < ti" and t||v||oo < tt, the 
product of sine terms is positive, so that with |[w^]^w*'^ | < 1, we obtain 

p p p 

WD* - T)\\l > 2^(1 - cos(v*t^) cos(vjr) - sin(v|r) sin(vjr)) = 2^(1 - cos(Aj)) = 4 ^ sin2(Aj/2) 
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where = v*i* — VjT. Now, since < t*v*,TVj < tt, we have Aj/2 £ [— 7r/2 , tt/2], hence using that 
sin^(u) > for \u\ < ^, we finaUy have 



h'>m*-ml > ^EA,^-^E(KtT + [v.r]^-2rrv>,)>-i(r-rf, 

where we have exploited that both v* and v are normahzed. As a consequence, we have t < t* + hence 
for h < ^{t — t*) we guarantee r < t, so that D e Zt- We conclude that Bh C Zt for h < f (t — t*). 

Third and last step: To recapitulate, we have shown that there exists a ball Bh in V, such that Bh ^ Zt 
and for any D Cz Bh, we have 

FnCD) > F„(D'^), 

since the previous inequality is true over the entire set Zf. We can finally observe using Lemma [1] that 

j=i i=i 
which leads to the advertised conclusion. 

B Proof of Theorem [3] and |4] 

We start with the more general theorem: 
B . 1 Proof of Theorem [3] 

We recall that we assume in Theorem [S] that cx ■ t < and for small enough noise levels a one can find a 
regularization parameter A > such that 



Cx ■ ^/t'^(T'^+ ma'^ < A < -a. 



Given such a and A, we define 

A 



7^ - >72toi(2). 



Here, cx and = "^"^^ stand for some universal constants which can be made explicit thanks to Theorem[5] 

Goal: To determine when the lower bound proved in Theorcm[5]is stricly positive, it is sufficient to consider 
when it holds that 

E[a']---t^ ~ coA-E[|«|].-.t.|||Do|||2-Mi) 

P P 

- C2{tkal + 2m(T^ + 2Xkal) ■ A„ with A„ ^ 
> i • (-02^^ + ait - flo) > 0, 

for some universal constants Cj which we can make explicit based on Theorem [5l but which we keep hidden 
for clarity. 



log(n) 
mp 
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Probability of success: The probability of success of Tfieorem [5] is given by 

l-(I^)'""''-exp(-4.e-^). 

This induces a first condition over 7 (a upperbound), namely 

— ^ 2 
ne > e-n ^ 1 < log(ri) — log(e„), for some e„ — > 00. 

From now on, we make the choice e„ = y^, so that exp(— 4ne~''' ) < exp(— 4v^), along with the condition 

7'<ilog(n). (30) 



Noiseless/low- noise regime: Even though they are conceptually two different regimes, the treatment of 
the noisy and noiseless regimes follow the very same reasoning. From now on, we therefore assume that 

ma- < t^al, (31) 

which determines the upper level of noise we will be able to handle. 

Second-order polynomial function in t: By simply using t and 3+2V2cj < AV2c^, 

we now make explicit the a^, j G {0, 1,2}, which define the second-order polynomial function in t: 

a2 = 3V2coc^-aaE[|a|]---|||Do|||2-A:-7 

P 

fli = E[a'^] ■ - - \/2cqc^ ■ aaV\\a\] ■ - ■ IDolj ■ ^A^o ■ 7 - Scicr^ • 7^6"'*' 

ao = 2\/2c^ • fccr^ . [^^^3g-7" _^2c2 • 7- An]- 
We will make use of the following simple lemma to discuss the sign of this polynomial function: 
Lemma 2. Let (ao, ai, 02) G K+- If 4:aoa2 < a\, and t &[^, ^] , then —02^^ + ait — ag > 0. 

Some key definitions: Let 6 be defined as 

a 1 Q„ 



(Tq'Sco fc-|||Do|||2 
We also define 7inin > 1 the unique number such that 



and 



Moreover, we consider 



^mine""°"" = ^^r^---^' (32) 



1 



v/2toiH,-^=— ■ . (33) 

A„,max = -ioq" 2 ' 2 ' ^' (^4) 

138c2C^ p ■ T 
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First step, non-emptiness of [7min , 7max] : We first check that the interval [7min,7max] is not empty. On 
the one hand, if the value of 7max is obtained by \/l/2 log(n), we use the fact that 'jmin < 7max is equivalent 
to 7min^~''^"'" > 7max6~''^°""'- particular, we have 

a condition that will be implied by the more stringent condition A„ < A„_max- 

On the other hand, and in the second scenario for 7max, we conclude based on the following lemma: 



Lemma 3. Let a > 1 and b e (0, 1/5]. If a^e'" = b, then ^log(l/6) < a < 2yJ\og{l/b). 
The sufficient condition which stems form this lemma reads 

1 Qa 



4V2coc^ /log (69cic2 • -L . £) 



■ III Do III 2 < 



Second step, lower bound on oi: For any 7 S [7min , 7max] , it is first easy to check that 

k 1 fc 

\2cqc^ ■ craE[|a|] • — • |||Do|||n ■ fc^o ■ 7 < t^I^I"^^] ' ~- 
p A p 

Moreover, since 7^6"'''^ < ^'^e~^^ and j^Qa ■ ^ > 7minS~''^°""i therefore obtain that 

ai > E[a2] . - - 1 . E[a2] ■ E[a'^] ■->- ■ E[a^] • -. 

p A p A p 2 p 

Third step, the condition 4aoa2 < a\: Since we have ai > i • E[q;2] • |, and 

02 < 4-\/2c-y • fccr^ • max |ci7^e"''' , 2c2 • 7 • A„|, 
simple computations show that 7 > 7,„in and A„ < A„ max, as defined in and ([M)) . lead to 4aoa2 < a\. 

Conclusions: We have proved that for 7 e [7min , 7max] , A„ < A„^niax, and 

1 Qa 



k^iQ ■ III Do III 2 < 



4^coc^ y^log (69cic2 . ^ . 1) 



the lower bound provided by Theorem [5] is stricly positive for a radius t e ^] (see Lemma [2]) and a 

noise tr < aa^/mt. Taking the smallest allowed radius (i.e., t = ^ with 7 = 7max) leads to the displayed 
result. 



B.2 Proof of Theorem H 

We now discuss the version of Theorem [3] in the simpler setting where there is no noise (i.e., cr = 0) and 
cxq is almost surely bounded by S > a > 0. The main consequence of these simplifying assumptions is that 
there is no residual term to consider anymore and our surrogate function coincide almost surely with the 
true sparse coding function, provided the radius t is small enough, as proved in Proposition As a result, 
the terms depending on 7 in Theorem [5] disappear, and the probability of success simplifies to 

/mpn\ -'"P/2 

Moreover, in light of Proposition |4l we now ask for 

-c\\fkat < A < -a. 
3 - - 9- 

The backbone of the proof remains identical, we adapt the discussion about the polynomial function in t. 
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Goal: To determine when the lower bound proved in Theoreni[S]is stricly positive, it is sufficient to consider 
when it holds that 



log(n) 
mp 



k k 
E\a^]---t^ - coA-E[|a|] • - -i- |||Do||U -fc^W 
P P 

- ci{tkal + 2\kal) ■ A„ with A„ = 

> i • (-02^^ + ait - ao) > 0, 

for some universal constants Cj which we can make explicit based on Theorem [5l but which we keep hidden 
for clarity. 

Second-order polynomial function in t: By making the choice A = -a-y/k-t, wc now make explicit 
the ttj, j G {0, 1, 2}, which define the second-order polynomial function in t: 

a2 = \c^c^-a-nM-y\\D4^-k^'^ 

ai = ¥.[a']---]-CQCx-a-¥.[\a\]---\\D4^-e'^^ia 
p 2 p 

ao = 2cic\ ■ k^^^aaCi ■ A„. 
Conclusions: Consider the condition 

|||Do|||2-fc^/Vo<— 

Coca a ■ CTq 

so that oi > i • E[a^] • |. By using again Lemmas and by defining 



A - 



1 E[a2] 1 . ( a 1 
9cic? kp \ ctq ' 5co /c • |||Do 



2 



it is easy to check that A„ < An.max implies that 4aoa2 < af along with 

'^ai 9cxa y/k' 
as required by our choice of A and the fact that A < |a. 

C Uniform restricted isometry and coherence properties 

First, we introduce Pj(i) G M™'^™ the orthogonal projector which projects onto the span of [D(<)]j and 
establish a result that holds without any assumption on Dq. 

Lemma 4. For any W e Wdo , v e 5^, t > and J, 

|||[DW-Do]j|||^<||[D(t)-Do]j||^ < t'-\W,\\l (35) 
|||(I-P,j(t))[Do]j|||^<||(I-Pj(t))[Do]j||? < t'-||vj||i (36) 

Proof. For the first result wc observe 

||[D(t) - BoUl = J2 l|do(^) - d^oll2 = 4^sin2(v,V2) < 4^] < f ■ ||vj"2 



j'eJ jeJ jGJ 



4-- 
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For the second one, using Lemma[T]with Di = D(t) = D(Do, W,v,t), D2 = Do, there exists W G yVD(t) 
such that for each j, dg = (t) cos(vjf) + w'^ sin(vjt). Hence, denoting C = Diag(cos(vji)) and S = 
Diag(sin(vjt)) we have [Do],i = [D(t)C]j + [W'SJj. Each column of [D(i)C]j belongs to the span of the 
columns of [D(t)]j, so that 

(I - P.) W)[Do]j = (I - Pj(t))[W'S],,. (37) 



As a result. 



1(1 - Pj(t))[Do],j!|^ = 11(1 - Pj(0)[W'S]j||2 < ||[W'S],,|1? = ^ sin2(v,t) < |lvj|l^ • ^ 



□ 



Next, we control the norms of 0,j(t') = [Dj(t')Dj(t')] when this is a well-defined matrix. For that, 
we first recall the definition of the restricted isometry constant of order fc of a dictionary D, (5fc(D), as the 
smallest number 5k such that for any support set J of size |J| = fc and z S R*"', 

(l-4)||z||^<||Dzi|2<(l + <5,)||z||2. (38) 



Lemma 5. Let Do G W^^'p he a dictionary and fc such that ^^(Do) < 1. For any t < -^/l — (5/j(Do) define 

Ct = ^ . (39) 

Vl-4(Do)-t 

For any W G Wdoi v G 5^, < t' < t and J of size k, the J x J matrix 

&,it')^[Bjit')-D,it')]-' (40) 

is well defined and we have 

|||Dj(t')||l2 = ll|Dj(t')lll2 < (41) 
\lQjit')\h < Ct (42) 
|||Dj(t')0j(t')lll2 < C-*- (43) 

Proof. By the triangle inequality and Lemma |4]-Equation ([35|) . for any J of size fc and z G K'^ we have 

||Dj(t')z||2 > II [Do]jz||2 - ||[D(t') - Do]jz||2 > (Vl-4(Do)-t'||vj||2) • ||z||2 > (Vl-4(Do) - t) ■ ||z||2 
||Dj(t)z||2 < ||[Do]jz||2 + ||[D(i) -Do],iz||2 < (v/l + 4(Do)+i'||vj||2) • ||z||2 < (v/l + 4(Do)+i) • ||z||2. 

Hence, in the sense of symmetric positive definite matrices 



(Vl - 4(Do) - t) • I ^ Dj(t')Dj(t') ^ (v/1 + 4(Do) + t) • I. 
As a result, Dj^(t')Dj(t') is invertible so 0j(t') is indeed well defined, and 

1 



|||Dj(OI|l2 = \\Ty]{t')\\^^Jl-D-l{t')T>,{t')\\^<^l + 5u{T>o)+t< — ^ ^ ^ 

|||0j(t')lll2 = |||(Dj(t')Dj(^0)-^|||, < ] 

(vl - 4(Do) - t) 



|||Dj(t')0j(t')lll2 = Vl||0j(t')Dj(t')Dj(t')ej(i')lll2 = \/lll®j(^')lll2 < 



v/i-4(Do)-t 

□ 
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To continue, we control certain norms of the dictionary when it has low coherence: 

Lemma 6. Let Dq G he a dictionary with coherence fi and normalized columns (i.e., with unit 

(.2-norm). For any J C \\;p\ with |J| < k, We have 

|||DjDj-I|||2< ||DjD,,-I|U<fcAi, 

along with 

|||DjDj|||2 = |||DjDj|||2 <l + k^l and 4(D) < k^. 

Similarly, it holds 

|||DjDj|||^< 1 + and |||D].Dj|||^ < fc/z. 

Moreover, introduce for any A G M'^^'^ the matrix norm 

7V(A)^/c. max |A,,,| 

and consider 

0a^[DjDj]-\ 

If we further assume k^ < 1, then 0j is well-defined and 

max {|||0j - III0J - IIII2, l|0j - I||p,iV(0j - I)} < 

along with 

max{|||0j||L,|||0j|||2} < 

Proof. These properties are already well-known [see, e.g. iTropd . I2OO4L iFuch j . l2005t . We briefly prove them. 
First, we introduce H = DJDj — I. A straightforward elementwise upper bound leads to 

|||H|||2<||H|U = ^ J2 (m^d^)^<MA^-iV'<fcV. 

ieJ jGJ\{j} 

This proves that in the sense of positive definite matrices, (1 — fc/i)I ^ DJDj ^ (1 + fc/i)I, which shows in 
turn the bound on 6^(0). Moreover, and since |||I|||2 = 1 with |||A^A|||2 = |||AA^|||2 for any matrix A, we 
have 

|||DjDj|||2 = |||DjDj|||2<l + fcM. 

By definition of we also have 

|||DjD,,||L < 1 + IIIHIIU = 1 + max ^ \[dYd^ < 1 + k^. 

Note that for |||DjcDj|||^, there are no diagonal terms to take into account. 

Now, ii kfi < 1 holds, then we have max{|||H|||^, ||| H|||^, ||H||f, iV(H)| < fc/z < 1 and there are convergent 
series expansion of [I + H]~^ in each of these norms [Horn and Johnson . 1990| . By sub-multiplicativity, we 
obtain 

00 

l|0j-i|| = llE(-i)'H'll< 

where ||.|| stands for one the four aforementioned matrix norms. The last result lies in the fact that for the 
norms ||| • ||| • III2, we have |||I||| = 1 and 

|||0j||| < - III + |||I||| < 1 + kfi/{l - k^i) = 1/(1 - k^i). 

□ 
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We now derive a simple corollary which will be useful for the computation of expectations: 

Corollary 1. Let D g K™^p be a dictionary with normalized columns and coherence fj.. With the notation 
from Lemma\^ if kfi < 1, we have for any a G {1, 2} and for any J C with |J| < k, 

max |[0°], J < 
Proof. We first make use of Lemma [6] which gives 

iV(0j-I) = fc. max |[ej-I],.,| < 

which notably implies that 

max < ^. 

We continue by noticing that [0j — I]^ = 0j — I + 2(1 — 0j) and by sub-multiplicativity of TV 

N{[®j - If) < [iV(0,j - I)f < 



Applying the triangle inequality, we obtain 

N{&] - I) < 2iV(0j - I) + < 2fcA^(l - M + (M' ^ 2kn - (fc/i)^ ^ 2kii 



(1 - fc/^)2 - (1 - fc^)2 (1 - A;^)2 - (1 _ fc^)2 

As a result, we finally get 

max |[0j],,j|< ^'■^ 



hence the advertised conclusion. □ 

Corollary 2. Let Dq G M'"^'' &e a dictionary with normalized columns. If k^{t) < 1/2 then, for any 
W e Wdo , V e 5p and < t' < t iwe have 

III [Dj.(t')Dj(i')] [Dj(t')Dj(i')]"'|IL < = ^-"WQ? = - 1< 1, (44) 

where we introduce 

Qt = ^ > Ct. 
VI -MO " 

D Expectation over J 

Lemma 7. Let Dp G be any dictionary and J a random support. Denoting by S{i) = llj(i) i/ie 

indicator function of J, we assume that for all i j G [l;p]| 

nsm = - 

T/ien we /laue for any v G 5'' and < i' < 

E{||[Do]J[Do]j-I||^} = |lDjDo-ie-^fc|| (45) 

E{||vj||i} ^ - (46) 



E{||Dj(0Dj(0-I||K-||v,,i|2} < (||DjDo-I|l.-W^l •- + 2.C,.i.-. (47) 
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Proof. To obtain (pS)) and (|46p we simply expand 

A:(fc - 1) 



'eli;pl j6li;pljyi ieli;plieli;p]j5^* ^ 

E{||v,,||^} = e{ E '5«-vn= E -v? = --||v!|^ = -. 

L J ^-^ p p p 



Now, by Lemma [5] and the Cauchy-Schwartz inequality for random variables 

E{||Dj(t')Dj(i') - I||p • ||vj||2} < E{||[Do]J[Do]j IW.h} + 2-Cft- E{||v,,||2} 



< ^E{|1 [Do]J [Do],, - I^} . ^E{||vj||i} + 2-Cft-- 



□ 



E Proof of Proposition [2] 

In this section, we establish the results required to lower bound A$„(W,v,t). We denote 

A(/),.(W,v,t) ^ 0,.(D(W,v,t)|s^) - q^^.CDoK). (48) 
The overall approach consists of the following steps: 
1. Concentration around the expectation: 

Lemma 8. Under our signal model, for any W £ VVdo ■ v £ 5'', r £ [0, \/n\, we have 

PrfA$„(W,v,t) <E{A0x(W,v,t)}-c(t)^) <2.cxp(-r2) (49) 



with 

c{t) = 102 • [t^al + 2m.a^ + 2Afccr„) (50) 

2. Control of the Lipschitz constant: the second step consists in showing that (W, v) i— > A$„(W, v, t) 
is Lipschitz with controlled constant with respect to the metric 



<i((W,v),(W',v')) =max j max llw^' - w'^' lU , llv-v'IU 

Lj6[1;p1 



(51) 



Lemma 9. Assume that t < ^1 — (5/j(Do). Under our signal model we have for any r £ [0, 
except with probability at most 2exp(— r^); for all (W,v) and (W',v') 

|A$„(W,v,i)-A$„(W',v',<)| <L - + •d((W,v),(W',v')). 

where 

L^SOCf ■t-[5{kal+ma^) + X'^k] (52) 

3. e-net argument: combining Lemmata [BJH] together with an estimate of the size of an e-net of W x 5^ 
with respect to the considered metric, we obtain 
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Lemma 10. Assume 
that 



that t < -^/l — Skijio) and that Ct < 1.5. Under our signal model, and a. 

> mp 



ssummg 



log n 



we have, except with probability at most {^-^^^) 



inf A$„(W,v,i)> inf E\A(hJW,v,t)}-B 



los 



mp- 



with 



B = 3045 [kal ■ t + 2ma'^ + Xka^ + A^fc • t) . 

4. Control in expectation: 

Lemma 11. Assume that k^{t) < 1/2. Under our signal model, we have 



(53) 



inf E{A(?!)x(W,v,t)} > (l-ZC^ 



t 



2 'p ■ 



|||Do|||2-fc/^W-A-(4Q?A + 3E{|ao|}) 



withlC = Ct ■ (IIID0III2 • ^/kjp + t). 
We obtain Proposition [2] by combining Lemmata [TOlfTTl Wc now proceed to the proof of these lemmata. 

E.l Expansion of Acpy^ 

We expand A0x into the sum of six terms. 

Lemma 12. We have 

A0x(t) - 0x(D(t)|s)-0x(Do|s) 



-xT[Pj(0) - P,j(t)]x - AsJ[ej(0)[Do]J - ej(i)[D(i)]J]x 
+ysJ[0j(O)-©j(t)]sj 



where 



Doao 



Co. At) 

CeAt) 



Pj(0)-Pj(i) 
Pj(0)-P,i(i) 

Cs.At) - -A[sign(ao)]j[0j(O)[Do]J-0j(i)[D(t)]J 
Cs,.W - -A[sign(ao)]j[0j(O)[Do]J-0j(t)[D(i)]J 

Cs At) = y [sign(ao)]I [0,1(0) - 0,,(i)] sign(ao)j. 



DoQ!o 



(54) 
(55) 

(56) 
(57) 
(58) 
(59) 
(60) 

(61) 
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Proof. Denoting s = sign(Q;o) and J C Jl;p] the support of cco, we have by definition (see Equation (O): 

= ^[l|x||^ - ([D(i)]Jx - Asj)^([D(i)]J [D(t)]j)-i([D(0]Jx - Asj)] 

= i||x||^-ixTp,,(t)x + Asj0,,(t)[D(t)]Jx-ysj0j(t)sj. (62) 
This yields (|54|) and wc conclude thanks to x = Doao + £ = [Do],i[q:o]j + £■ □ 

E.2 Proof of Lemma [5] 

Fix W and V and denote (<) = (/)xi(D(W, v,t)|sQ). By definition of (/)x we have (t) < £x> (D(W, v, t), ccg) 
hence, using Lemma we have for any r > 1 

Piiy'{t)>Ac{t)-T)<e-^ 

where 

A ,.^ A 5(1 + log 2) .22, 2 , \, \ 
Hence, exploiting Corollary 2] with k = 1 and < ^ < y/n, we obtain. 



Observing that A$„(W, v,t) = i E"=i(2/'(0 " 2/*(0)) we obtain, by a union bound, 
Pr 

We conclude by expliciting 



A$„(W,v,t)-E{A$„(W,v,0} > 2A{Ac{t) + Ac{0)) ■ ^) < 2exp(-r2). 



24(A£(t) + ^£(0)) = 60(1 + log 2) {t^al + 2ma^ + 2\kac?) < c{t). 
E.3 Proof of Lemma O 

Given the expansion ([M)) . using the shorthands Pj = Pj(W,v,t) and Pj = Pj(W',v',t), as well as for 
other similar quantities, and averaging over n, we obtain 



A$„ - A$' 



i—l i—1 



-^• max |||0,p-0:,.|||2 



(63) 



Using Lemma [T9l this yields the Lipschitz bound |a$„ - A$Jj < L„ • (i((W, v), (W, v')) with 
i„ < ^-t-C^ J -V ||xl2 + 2AVI--y llx'lla + A^A;^ 

y i=l i=l ) 

Using Lemma [22] we check that = Ijx'Hj satisfies the hypothesis (see Eq. (|100p ) of Lemma EH with 
A = 5(fcCT^ + mcr^). Hence, exploiting Corollary 2] with k = 1 and < r < -^/n, we obtain, except with 
probability at most 2exp(— r^) 

At 



1 / 

-^||xl^ < 6.[5.(fca^+ma2)]. 1 + 

1=1 ^ 

1 " / 

-^||xl2 < 6- V5-(fca2 + ,7za2). 1 



i=i ^ V 



4t 
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Inserting the above estimates into yields, except with probabihty at most 2exp(— r^), 



4t 

L„ < L' . ( 1 + — 



2 



E.4 Proof of Lemma [TO] 

The proof of Lemma [10] exploits the covering number M of Wpn x S"^ with respect to the metric ([5T]) . For 
background about covering numbers, we refer the reader to Cucker and Smal3 2002 and references therein. 

Lemma 13 (e-nets for Wdo ^ 5^). For the Euclidean metric, and j or any e> 0, we have 



Moreover, define on M™^^ the norm ri(M) ^ maxjg|i.p] ||m-'||2. For the metric induced by f2, and for any 
e > 0, we have 

AA(WDo,e)< (l + -) 

Proof. We resort to Lemma 2 in IVershvninI |2010f . which gives the first conclusion for the sphere in MP. As 
for the second result, remember that the set Wdq is defined as a product of spheres in spaces of dimension 
m — 1. Indeed, we have for any W G Wdo and for any j £ |l;p], |iw-'||2 = 1 along with the constraint 
[dgj^w-' = 0, which implies that belongs to the orthogonal space of span(dQ) of dimension m — 1. 
Considering a product of p nets such as that used for S^, the second conclusion follows from the definition 
of the metric based on fi. □ 

From Lemma [13] we know that for any < e < 1 there exists e-net of W x with respect to the 
metric (|5T]) with at most (3/e)™'' elements. Combining this with Lemmata[8][9l we have for any < r < ^/n: 
except with probability at most (3/e)™^ • 2exp(-T2) + 2exp(-r2) < 4 • (3/e)"^' • exp(-T2) 

inf A$„(W,v,t) > inf E{A$„(W,v,t)} - (c(i) • ^ + L • ( 1 + ^ ) • e 



w 



w 



Now we set r = y/ mp log ?i, and e = = y mp " . Under the assumption that 

n 

> mp 

logn 

we check that r < ^/n, e < 1, hence, the probability bound holds. We estimate the probability bound with : 

3 



, 3 
log- 

e 



mp log T 

e 



log- 



/mp 



log 



n 1,9 

] < o log 

log n Z mp 



mp 9 mp 

< —— log i — — log n — mp log n 

2 7np 2 



- i log n 

mp 9 

^log 

2 mp 



mp 

— logn 



(3/e)'"fexp(-r2) 
Finally, recalling that 



< 



/ mpn \ 
V~9~/ 



-?7ip/2 



L ^ 30Cf ■ t ■ [5{kcrl 



c(t) = 102 • (t^al + 2ma^ + 2Afccr„) , 
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and since the assumption Ct < 1.5 imphes 150Cf < 507, we obtain 

c{t) + L < 609 {kal ■ t + 2ma'^ + Xk(7c, + X^k -t) ^ B/5 
c{t)- ^ + L- + -e < {c{t) + L) ■ € + 4{c{t) + L)€^ ^ {c{t) + L)5e < Be. 

E.5 Proof of Lemma 1111 

First, we observe that by the statistical independence between a and s we have 

Moreover, we can rewrite 

CcAt) - i-Tr([ao]j[ao]J-[Do]J(Pj(0)-PjW)[Do]j) 
CeAt) - i-Tr(££T.(Pj(0)-P,,(i))) 

CsAt) = -A • Tr ([ao]j sign(ao)J • [0j(O)[Do]J - &j{t)[B{t)]J] [Do]j) 
Cs.sW = y •Tr(0j(O)-0j(i)). 
Since the coefficients cx,£ are independent from the support J we obtain 

nUAt)} = ^^•E,;{Tr([Do]J(I-P,,(i))[Do]j)} (64) 

E{Us{t)} = ^^•Ej{Tr(Pj(0)-P,j(t))}=0 (65) 
nCsAt)} - -A.E{H}.E,,{Tr([0j(O)[Do]J-©j(O[D(t)]J][Do],,)} (66) 
E{Cs,s(i)} = y •E,,{Tr(©j(O)-0j(i))} (67) 

where we used the fact that: (a) Pj(0)Do = Dq; (b) since Pj(t) is an orthogonal projector onto a subspace 
of dimension k, Tr(Pj(0) - Pj(<)) = k - k = 0. 

The lemma below provide estimates of the remaining non-vanishing expectations which come up in the 
quadratic forms (|56p and (|61l) and the bilinear form (|59p . They directly provide Lemma [TT] as a corollary. 

Lemma 14. // kfi{t) < 1/2 i/ien /or any W e VVdo ; v e 5'' loe have 

Ej{Tr([Do]J(I-P,,(t))[Do],,)} > {1 - IC') ■ '^f' (68) 
|Ej{Tr([0j(O)[Do]J-0j(t)[D(t)j]^][Do]j)}| < 3Q? • i • ^ • IDoi^ • M*) (69) 
|Ej{Tr(0j(O)-0j(i))}| < 8g^^•^•|||Do|||2•M^)■ (70) 

mi/i^^Cf(|||Do|||2- + 

Proof of Lem ma fm - Equ ation (|68| . Since fc/x(t) < 1/2, we have i < l/(6fc) < 1/6 and (5fc(Do) < kfio < 1/2, 
so that t < \/l — 6k(Do). In particular, we have t < tt/2 and the matrix C = Diag(cos(vjt)) is invertible. 
From the equality T>{t) = DqC + WS with S = Diag(cos(vji)) we deduce Dq = 'D{t)C-^ - WT with 
T = Diag(tan(vjt)). Since the columns of [D(t)C^^]j belong to the span of [D(t)]j we obtain 

Tr([Do]J (I - Pj(t))[Do],,) = 11(1 - Pj(t))[Do],,||? = ||(I - P,:(0)[WT],||? 

= ||[WT],t||2-||P,(<)[WT]j||2. 
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For the first term, since |lw-' ||2 = 1, we have 

p 



hence, since ||v||2 — 1, and tan^(u) > v? for |w| < 1 we have 

E ||[WT]a|l^ ^ L IIWTII^ = ^ . f:tan^(v,0^ > - • E t^v| = ^ • i^. 

For the second term, since P,T(t) = Dj(t)0j(t)Dj(t), using Lemma [5l we have the bound 

||Pj(t)[WT],,l|2 < CM|Dj(i)[WT],,||^, 
Moreover, by Lemma |H using the Cauchy-Schwarz inequaUty for random variables 

||Dj(i)[WT]j||, < ||[Do]J[WT]j|U + ||[D(t)-Do]J[WT]j||,<||[Do]J[WT]j||, + i.||[WT]j||„ 

Dj(t)[WTb|l?} < E{ll[Do]J[WT]je}+2t.E{|l[Do]J[WT],,|l,.|l[WT]j|U}+t2.E{jl[WT],,|l^} 



< E{||[Do]J[WT]j||2}+2tyE{||[Do]J[WT]a|i2} VE{||[WT]j||2} 
+t'-E{\\[WT]j\\l} 



< 



V 2 

k 



E{||[D„]J[WT],j||2}+t../-.||WTl|, 



P 



Now, proceeding as in Lemma [71 we compute 



E 

hence 



[Do]J[WT]j|| 



p{p - 1) " P{P - 1) 



2 , ,— ^ 2 



||Dj(t)[WT]j|U < ^ • ||WT|1^ • ^ ■ ll|Do^lll2 + < ^ ■ l|WT||^ • . IDoi^ +i 

Putting the pieces together, we obtain the fowcr bound 

Tr([Do]J(I-Pj(t))[Do]j) > '^■\\WT\\l-(^l-I^Q.-(ino\l,-^ + t 

= -•||WTi|^(l-/C2). 



□ 



Proof of Lemma^T^- Equation We first devefop Equation (p^ and use that 0j(O)[Do]7[Do]j = I in 

order to obtain 

Tr ([0j(O)[Do]J - 0j(t)D(t)J(t)] [Do]j) = fc - Tr (0j(t)[D(t)]J [Do]j) . 
AppyUng Lemma [TJ wc know there exists Wf G WD(t) such that 

Do = D(t)Diag(cos(vji)) + WtDiag(sin(vjt)), 
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and the trace above further simphfies as 

IV([0j(O)[Do]J-ej(i)[D(O]J][Do]j) 



k - ^cos(v,t) - Tr (0j(t)[D(t)]J [WiSW]j) , 



^(1 - cos(v,t)) - Tr (0j(i)[D(t)]J[WtS(i)]j) 



where for short, we refer to Diag(sin(vjt)) as S{t). 
The first term is simple to handle since we have 



Ej[5](l-cos(v,t))] <^E,,[|lv,,l|2] = ^^. 



2p 



We now turn to the second term whose control is more involved. Following Geng et al. |2011 |. we introduce 
the self-adjoint operator r]3(() defined for any M e M™'^p by 

rDw(M)= [ri(i)mi,...,rp(i)mp], withr,(t) ^i-d(ty[d(tvr. 

In words, Tufj) (M) projects each column of M onto the orthogonal complement of the corresponding column 
of the dictionary D(t). In particular, note that for any M S WD(t), we therefore have rD(t)(M) = M. 
Considering the symmetric matrix U(t) = Ej [nj0j(i)nj] , we next obtain 

E,T[Tr (0j(i)[D(O]J[WtS(t)],,) ] = Ej [Tr (nj0j(t)nj [D(i)]^WtS(i)) ] 

= Tr(U(t)[D(t)]TWtS(i)) 
= Tr((D(t)U(t))TrD(t)(WtS(t))) 
= Tr(rD(t)(D(i)U(t))WtS(0) 
< t||rD(*)(D(i)U(<))||p, 

where we have successively used the fact that ^-oit) is self-adjoint and that for any W e WD(t), the norm 
||WfS(t)||p is upper bounded by t. 

Observe that the j-th column of the matrix Tj{t) D is equal to zero. As a consequence, we have 

lirD(t)(D(i)u(t))||, = i|rD(t)(D(t)UoffW)||., 

where Uoff(i) denotes the matrix U(t) with its diagonal terms set to zero. This leads to 

p 



|rD(*)(D(OU(t))||? = i|rD(o(D(t)Uoff(0)ll^-^||r,(t)D(t)u: 



■off II 2 



P 



< iiiD(i)iii^^iiu^„gii^ = i\T>mi\\Ej[ujejmJ]j\ 



where we have exploited the fact that projectors have their spectral norms bounded by one. Using Corollary[Tl 
we have for i 7^ j with i,j S 



\[iijejmj]j<smj)j 



kfi(t) 



and 



ij[(nj0j(i)nj),,,]|< 



p{p - 1) 1 - kf,{t) - 
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hence 



To recapitulate and putting all the pieces together, we obtain the following upper bound 

|Ej{Tr([0j(O)[Do]J-0jW[D(t)jn[Do]j)}| < -- + t|||D(t)|||,- ^ /^^^^^ 



k 

< t- 
P 



^-|||D(t)|||,^^ 



To conclude, we use Lemma 0] to get |||D(t)|||2 < 2|||Do|||2, and the fact that |||Do|||2 > 1- □ 
Proof of Lemma \14\ - Equation (|70p . We start by writting Equation (j70p in the following integral form 

Tr(0j(i)-0j(O))= / Tr(Vt0,,(T))dT, 
Jo 

where the derivative is computed in Lemma flTl namely, 

Tr{Vt&j{t)) - -2Tr(0j(t)[VtD(t)]jDj(i)0jW)- 

Introducing the symmetric matrix U(t) = Ej [nj[0j(t)]2nj] , we next obtain by linearity of the trace and 
the integral 

Ei[Tr(0im - 0i(O))] = -2 / Tr ( B ( t)JJ ( t) tB ( t)]'^ ) dr < 2t max TrfDMUMfVtDM] 

Jo V / re[o.t] V 

Noticing that we are (almost) in the same setting as that of the previous proof, we are going to make 
use again of the operator ^D{t) in order to control the off-diagonal terms of U(t). More precisely, since 
diag([VfD(T)]^D(r)) = and ||VtD(T)||p = 1, the same reasoning as that followed in the previous proof 
leads to 

Ej[Tr(0,,(t) - 0,,(O))] < 2t ■ max |||D(r)|||2 • ||E,j [n, [0,,(r)]2nJ] ^Jj,. 

[(i,t\ 

Invoking Corollary (TJ wc have for i ^ j with z, j G 

|M4(n.|e.Ml'nJ,,,,]|<i|^^^J^|^. 

h6nc6 

||E,[n,[0,(rrnJ]j|g<p(p-i)( ^|^_^j^^/^^|^^^2 ) <j4 



which gives the advertised conclusion. □ 

F Proof of Proposition [3] 

We begin by a few lemmata related to the considered optimization problem. 

Lemma 15. Let J C and s g { — 1, 0, l}!'''. Consider a dictionary D e M™^^ such that DJDj is 

invertible. Consider also the vector a € W defined by 



a 



[DjDj]-i[Djx-As] 

Ojc 



with x e R™ and X a nonnegative scalar. If x ^ [Do]j[q:o]j + £ for some (Do,Q:o,£) G R™^p x Rp x 
then we have 

II [a - ao] jIU < III [DjDj]-i||U [a + ||DJ (x - Dcq) ||c 
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Proof. The proof consists of simple algebraic manipulations. We plug the expression of x into that of a, 
then use the triangle inequality for ||.||oo, along with the definition and the sub-multiplicativity of □ 

Lemma 16. Let x G M™ be a signal. Consider J C and a dictionary D G IR™'^^' such that DJDj is 

invertible. Consider also a sign vector s G {—1, 1}'''' and define a G by 



[D|Dj]-i[D|x-As] 



a = , 

Ojc 



for some regularization parameter A > 0. // the following two conditions hold 



rsign([DjDj]-i[Djx-As] 

l||Dj.(I-Pj)x!U + A|||Dj:Da[DjDa]-i||U < A, 
then a is the unique solution o/ minagRp [^||x — Dajlj + A||q;||i] and we have sign(Q;j) = s. 
Proof. We first ch eck that a is a solution of the Lasso program. It is well-known [e.g., see iFuchd . 120051. 



Wainwrighj . 12009} that this statement is equivalent to the existence of a subgradicnt z G 9||q;||i such that 
— D^(x — Da) + Az = 0, where Zj = sign(Q;j) if a.j ^ 0, and |zj| < 1 otherwise. 

Wc now build from s such a subgradicnt. Given the definition of a. and the assumption made on its sign, 
we can take zj = s. It now remains to find a subgradicnt on J'^ that agrees with the fact that ajc = 0. More 
precisely, we define zjc by 

Azjc ^Dje(x-Da) = DJc(I-Pj)x + ADJcDj[DJDj]-1s. 

Using our assumption, we have ||zj<:||oo < 1- We have therefore pro ved that a is a solution of the Lasso 
program. The uniqueness comes from Lemma 1 in IWainwrightl 12009} . □ 

Corollary 3. Assume that kfi{t) < 1/2, <t' <t, |A < a < minjgj |[ao]j|, and that 

||[D(t')]I(x-D(t')ao)||oo < A(2-Q?) (71) 
|l[D(0]Je(I-P,,(t'))x||oo < A(2-Q?) (72) 

Then a{t') is the unique solution o/minQ,gRp[^||x — D(t')Q:j|2 + Aj|Q:||i] 

Proof. Since kfi{t) < 1/2, we have < 2, and by Corollary [5] we have, uniformly for all (W,v) and 
0<t' <t 



lll[[D(t')]7[D(t')]j]"'llL < Qt 

|||[D(f')]j4D(i')]j([D(t')]J[D(t')]j)"'llloo < 0?-l<l 
Exploiting Lemma \W\ and the bound (j7ip we have 

||[«(t')-«o],i||oc^ < |||[[D(0]J[D(<')]j]"'llL[A+ll[D(t')]I(x-D(t')«o)ll< 

< Q? • A • [1 + (2 - Q2)] = a • g2 . (3 - Q?) < ^A < a < min |[ao],|, 

where we used that u{3 — u) < 9/4 for all w G M. We conclude that sign(Q;(t')) = sign(ao)- 

It remains to prove that a.{t') is the unique solution of the Lasso program. To this end, we take advantage 
of Lemma [111 We recall the quantity which needs to be smaller than A 

||[D(t')]Jo(I-Pj(t'))x|U + A|||[D(f')jo]^[D(t')]j([D(f')]J[D(t')]j)"'llL- 
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The quantity above is first upper bounded by 

||[D(i')]Jc(I-P,,(0)x||oo + A(0?-l), 

and then, exploiting the bound (f72|) . strictly upper bounded by A(2 — Q^) + \{Q^ — 1) = A. Putting together 
the pieces with sign(Q;(i')) = sign(ao)! Lemma [TBI leads to the desired conclusion. 

□ 

We can now proceed to the proof of Proposition [H Since ||d-' (i')||2 = 1 for all j, we have 

||[D(t')]J(x-D(t')ao)||oo < ||x-D(t')ao||2 (73) 
||[D(t')]Jc(I-P,j(t'))x||oo < !|(I-P,,(t'))x||2 (74) 

Using Lemma [221 provided that 



5 • (t'2 • 0-2 + TO • 0-2) 

we have 

Pr (||x - D(t')aol|2 > A(2 - Q?)) = (||x - D(i')aoll2 > 5(t'' -cjl+m- a^)T) < exp(-T) 
Pr (11(1 - Pj(t'))x||2 > A(2 - Q?)) ^ _ Pj(i'))x||2 > 5(i'2 . al + m ■ a')T) < cxp(-r) 

With a union bound, we conclude that ||x-D(t')Q:o|l2 < A(2-Q^) and ||(I-Pj(t'))x||2 > A(2-(3^), except 
with probability at most 

Pr (^coincidc(i')) < 2 • cxp(-r) = 2 • exp L-^^i^^lMll^ 

\ 5 • (r^ ■ + m ■ cr^) 

G Proof of Proposition [4] 

We now consider the proof of Proposition S] whose structure is identical to that of Proposition [31 We recall 
that we are in noiseless setting, i.e., (7 = 0, and we assume that the coefBcients of cxq are almost surely 
bounded by a. 

In the light of Lemma [H let us first observe that almost surely 

||[D(t')]J(x-D(t')ao)||oo < |lx-D(i')ao||2 < ||[Do-D(t')]j[ao]j||2 <t-||[ao]j||2 < Vkat. 
Similarly, it follows 

||[D(f')]]:(I-P,T(0)x||co < ||(I-Pj(t'))xi|2 < Vkat. 

Now, we can apply CoroUary [3l provided that ^/kat < A(2 — Q^), as required by Proposition [H This leads 
to the desired conclusion. 

H Proof of Proposition [5] 

Exploiting Proposition [31 we have 

max Pr(K„i„,ide(i) U f^oi„cide(0)]^) < 4 exp L-I^^f-MliL) A ^. (75) 



The assumption (|24p is equivalent to 

3 4 

t ■ Ga < —Ot 



2-Q? " 9 
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hence there exists mdeed a > and A satisfying the assumption (pS)) . Moreover, since 5 log 4 « 6.93 < 9, 
the assumption ((25)) imphcs that 



7^ A2(2-Q?)2 



log 4 5log4 ■ {t^a^ + ma^) 



> 1, 



hence 7^ > log 4 and k = 4 • e < 1. Therefore, we can exploit Lemma [53] and Corollary 2] Given 
with 



A U2 2 , _2 , . ^ 5(l + log2) 



^ . cr^ + 2?^ • CT^ + 2Afc • (T„) 
we have, except with probability at most cxp(— ns) = exp(— 4rie~ 



■7 



2 



2^ . 4.-7' 



r„ < lOA • (3 -logK) • K = yl • 10- (3- log4 + 7^) • 4e 

3 2 

< (t^ ■ al + 2m • + 2AfcCT„) • 10(1 + log 2) • 10 • 7^ • e'^ 

log 4 

< (t^ . + 2m • + 2Afcaa) • 367 • 7^ • e"''' . 

I Technical lemmas 

The final section of this appendix gathers technical lemmas required by the main results of the paper. 
I.l Control on the differences of operators 

We will now establish several lemmata regarding the difference of operators that appear in the paper. 

The following result will exploit Taylor formula with remainder, b ased on simple matrix and ve ctor 



derivative computations of D(W,v,i); we refer the interested reader to iMagnus and Neudecken [1988| for 
details about such manipulations. For convenience, let us define 

C{t) = Diag(cos(vji)) (76) 

S{t) = Diag(sin(vjt)) (77) 

V ^ Diag(v,) (78) 

Rj(<) ^ Dj(O0j(t)[VtD(t)]J. (79) 

and denote the symmetric part of a square matrix M by sym(M) = ^(M + M^). 
Lemma 17. 

VtD(t) = (-DoS(t)+WC(i))V (80) 

||[VtD(t)]j||, = ||vj||2 (81) 

VtPj(i) = 2sym(Rj(i)(I-Pj(0)) (82) 

V40j(t)Dj(t)] = 0j(t)([VtD(t)]J(I-P,,(O)-[D(i)]J[R,,(t)]^) (83) 

Vt[@j{t)] = -2sym(0j(t)[VtD(t)]jDj(t)0j(t)). (84) 



Lemma 18. Assume t < \/l ~ (Sfc(Do), then for any W G Wd,,; £ and J with |J| < k we have 

|||Pj(t)-Pj(0)|||2< |lPj(i)-P.i(0)||K < 2t-Cf\Wjh, (85) 

|||0,j(i)[D]J(t)-0j(O)[Do]J|||2<|10j(O[D]J(t)-0,,(O)[Do]J||p < 2t-Cf-\Wjh, (86) 

|||0j(O-0j(O)|||2< |10j(t)-0j(O)||2 < 2t-C!-\\v,h. (87) 
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Lemma 19. Assume that t < -^/l — (5fc(Do). Denote Pj.i = Pj(Wi,vi,t) and Pj.2 = Pj(W2, V2, and 

similarly for the other considered quantities. For any Wi, W2 G Wdq; vi, V2 G S^, and J zwii/i |J| < k we 
have 

|||Dj,i -DJ.2III2 < ||Dj,i-Dj.2||p < 2t.a-d((Wi,vi),(W2,v2)) (88) 
|||Pj,i-Pj,2||l2 < I|Pj,i-Pj.2||f < 5i-a-d((Wi,Vi),(W2,V2)) (89) 
|||0j.i[Di]J-0j,2[D2]J|||2< ||ej.i[Di]J-0j,2[D2]J||p < 5i-a-d((Wi,vi),(W2,v2)) (90) 

|||0J,1 -0J,2|||2 < ||0Ja -0.I,2||f < 5t-Cf-d((Wi,Vi),(W2,V2)). (91) 

Proof of Lemma \18\ - Equation (j85p . We apply a Taylor formula with remainder [e.g., Theorem 14.4 in iDvml . 



20071 based on Lemma [17] (Equation for any U e R"'x™ there exists < t' = t'{lJ) < t such that 



Tr(u • (Pj(t) - Pj(0))j = 2t ■ Tr(u • sym(R,,(t')(I - Pj(t')))) < 2t • ||R,i(t')(I - PAt'))h ' l|Uj|p. 
Given that || [VD(t')]j||F = l|v.ii|2, we have using the bound (|43| 

llRj(t')l!F < |||Dj(t')ej(0lll2 ■ WNWM. < Ct ■ l|vj||2, (92) 

hence the upper bound 

Tr(u • (Pj{t) - Pj(0))) < 2t ■ ||Rj(t')l|F • l|U||p < 2t ■ Q ■ ||vj||2 • ||U||p. 

We conclude using the fact that |||Pj(t) - Pj(0)|||2 < ||Pj(t) - Pj(0)||p = max,iuiip<i Tr(UT(Pj(t) - Pj(0))), 

□ 

Proof of Lemma [T^ Equation (|86p . Again, we apply a Taylor formula with remainder and Lemma [17] (Equa- 
tion (dS])): for any U £ M™''?', there exists some 0<t' <t such that 

Tr (U(0j(t)Dj (t) - 0j(O)[Do]J)) = t • Tr (u [0j(t') ([V,D(i')]J (I - Pj(i')) - [D(t')]J [Rj(i')]^) 

< t ■ ||0j(i')([V,D(t')]J(I - Pj(t')) - [D(t')]J[Rj(t')]^)l|p • l|U| 
Now, using the bounds ([42]) , ([43]) and ([92]) we have 



||0j(t')([V,D(i')]J(I - Pj(t'))l|p < Il|0.j(t')lll2 • ll[VtD(t')]j||p < ■ ||vj||2, 

||0j(t')[D(t')]J[R,i(i')]^llF < |||0j(t')[D(i')]Jlll2 • I|Rj(^')IIf < Ct ■ ia ■ IIVJII2) < Cf ■ ||vj||2 
and wc can conclude. □ 

Proof of Lemma \18\ - Equation (|87p . We follow the same line, using the intermediate result from Lemma 1171 
(Equation For any U e there is some < <' = t'{U) < t such that 

|Tr (U • (0j(t) - 0j(O)))| = |2t • Tr (U • sym(0j(t')[VtD(t')]jD,,(i')0,i(t'))) | 

< 2i • ||0j(t')[V*D(i')]jDj(O0,i(i')llF • l|U||p. 
Since |j [VD(i')]j||F = ||vj||2, using and (1^^ wc obtain the upper bound 

2t . ||0j(i')[V*D(t')]jDj(i')0j(t')llF < 2t . |||Dj(i')0j(t')lll2 ■ II|0J(OIII2 ■ ll[VtD(i')]j||F 



< 2t.CfCMlvj|l2. 



□ 
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Proof of Lemma[Wi Since d((Wi,Vi), (W2,V2)) = max[maxj j|wj — W2j|2, Hvi — V2II2] < e, we can bound 
the difference between the columns of — D(Do, W,, v^, t), i = 1,2: 

dj — d2 = (cos(vji) — cos(v2t)) • dp + sin(v{t) • Wj — sin(v2t) • 
\\d\ - diWl = (cos(vii) - cos(v^2i))' + II sin(vjt) • wj - sin(v^2i) ' ^IWl 
= cos^(v{t) + cos^(v2t) — 2 cos(v{t) cos(v2t) 

+ sin^(vjt) +sin^(v^t) - 2sin(vjt) sin(v^i)[wj]^w^ 
= 2-2 cos(vjt) cos(v^t) - [2 - ||wj - w^||^] sin(v{t) sin(v^t) 

= l|wi - w^2ll2 • sin(v{t) sin(v^2i) + 4sin2 '""'^^^ < (e^vi^ + (vj - v^2)') 



As a rcsuh we obtain 

||Di -D2\\l = j2 ildi - dill^ < (e^v7v2 + l|vi - ..g) f < 2eh\ 

Exploiting Lemma[Il wc can write D2 = D(Di, W,v,t') with t' < |||Di - DaUp < -^et. 

Now consider D(t) = D(Di,W,v, r) with < t < t' and (t) its columns. Noticing that r i~> d^ {t) 
is a geodesic on the unit sphere that joins d^(0) = dj to d^(t') = d2, wc obtain 

||d^(r) - di^||2 < maxdldj - dih, \\di d^h) = 2sin (^^ 

Hence, exploiting Lemma [1] again, we can also write D(t) — D(Do, W, v', r'), with r' < This implies 
that for every dictionary on the curve r D(t), < t < t', the bounds of Lemma [S] with the constant 
Ct hold true. We can therefore repeat the Taylor argument of the proof of Lemma [THl noticing that since 
the considered end point is at t' < -^£t instead of t, the factor 2t in the resulting bounds is replaced by 

2t' < nV^et < 5et. □ 
1.2 Control of norms 

In this section, we first recall some known concentration results. 



Lemma 20 fFrom lHsu et all |201lj ). Let us consider z G M™ a random vector of independent sub- Gaussian 



variables with parameters upper bounded by a > 0. Let A G K'"^^' be a fixed matrix. For all r > 0, it holds 



VY[\\Az\\i > (j'{\\A\\i + 2^Tr[(ATA)2]T + 2|||A ' A|||2r)j < exp(-T). 

In particular, for any r > 1, we have 

Pr(||Az||2 >5a2||A||2r) < exp(-T). 

Lemma 21 (Bernstein's Inequality). Let be a collection of independent, zero-mean random 

variables. If there exist M, ^ G M+ such that for any integer k > 2 and any j £ |1; it holds 

then we have for any t > 0, 

n r> 

^'^(II^J >t) <exp(- 



2n?2 + 2Mt 
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In particular, for any t < we have 



1 " 



< cxp I 



Proof. The displayed result is a straightforward adaptation of Lemma 4.1.9 in iDe la Peiia and Ginel il999| . 
where we use the term in lieu of the true variance. □ 

Lemma 22 (Control of the ^2-iiorm of a signal and its coefficients). Let x be a signal following our generative 
model, and cxq be its coefficients. For any r > 1 and D = D(W, v,t), we have 



Pr(||x-Dao||2 + i|£!l2 > 5(<V^ + 2TOcr2)T) < exp(-T 

Pr(||x-DQ;o||2 > 5(tV^ +?«cr^)T) < cxp(-T 

Pr(|l(I-Pj(t))x||2>5(iV2^(m-fc)a2)r) < exp(-r 

Pr (1|q;o||2 > Sfccr^r) < cxp(-r 

Ft (W^Wl > 5{kal + mcr^y) < exp(-T 



(93) 
(94) 
(95) 
(96) 
(97) 



Proof. We prove the result for jjx — Dq:o||2 + Ikill- The same technique applies to the other quantities. We 
recall that x — Dap = [Do — D]j[q:o],i + £, and that the considered norm can be expressed as follows 



|x-Dao||^ + ||£j|^ 



a„[Do-D]j al 
ctI 



The result is a direct application of Lemma ?IU\ conditioned to the draw of J, using Lemma S] to control 



c7a[Do-D]j al 
cri 



[Do - D]j||^ CT^ + 2m.a' < t'a^ + 2ma 



The bound being independent of J, the result is also true without conditioning. Note that to control the 
behaviour of ||(I — Pj(t))x||2 we use the fact that since Pj(t) is an orthogonal projector on a subspace of 



dimension k, we have ||I — Pj(t)||j 



k. 



□ 



Lemma 23. Let x and CkQ be drawn according to our signal model. Define 

y 



y 



sup/:x(D(W,v,t),ao) 

W,v 

sup {/:x(D(W, V, t), ao) + /:x(Do, ao)} 

W,v 



For any t >1 we have 



whe 



Adt) 

Arit) 



^^{y > Ac{t) ■ t) < e- 
PT{y' > Ar{t) ■ t) < e" 



5(1 + log 2) ,,22, 2 , X, ^ 
^ • (* ''■q + "^cr + Xkaa) 



(98) 
(99) 



5(1 + log 2) 



{t'a' 



2ma + 2\kf7n 
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Proof. Using LemmalHwe have, for D = D(W,v,i), uniformly over W,v: 

C^{B,cxo) = i||x-DQ;o||2 + A||Q;o||i < i||x-Dao||2 + AVfc||ao| 



£x(D,ao) +/:x(Do,ao) < -||x - DaoH^ + -||£!|^ + 2A\/fc||ao||2. 



Fix any T > 1 and define r' = (l + log2)r > T + log2. If Hx-DaoHi < 5{t'^al+ma^)T' and ||ao||i < Sfca^r', 
then ||q!o1|2 < V^kaaVr' < Vbkaar' , hence we have 

y < Q {5t^<jl + 5ma^) + XVky^bkalJ r' < ^ (iV^ + mtr^ + Afccr„) t' = Ac{t) ■ r. 

Lemma and a union bound yield Pr(y > Ac (t) • r) < 2 exp(— r') < cxp(— r). The proof for y' is similar. □ 

Lemma 24. Let y be a random variable satisfying for any t > 1 

Pr {\y\ > At) < exp(-r). (100) 

for some positive constant A> 0. Consider an event £ defined on the same probability space as that of y. 
For any u > 1, any integer q > 1, and < p < 1, we have 



E 



E 



l,\y\P-E[le\y\P] 



< ql 



APu 
2APu 



Pr(£) + cxp(3 - u) 
Pr(f ) + exp(3 - u) 



(101) 
(102) 



Proof. To begin with, let us notice that by invoking twice the triangle inequality, we have 



(E{|l£|#-E{l£|yr}P})'^'<(E{l£|yr})'/^ + (E{(E{l£|yr})«}) 
so that by using Jensen's inequality, we obtain 

E[\l£\y\P -E[l£\y\P]\'} <2'^E[l£\y\ 



1/9 



i|P9 



thus proving (|102p provided that poip holds. We now focus on these raw moments. Let fix some u> 1. We 
introduce the event 



A 



< u 



}■ 



and define as the largest integer such that u £ lu + 1). We can then "discretize" the event IC^ as 



/C' C (J /Cf , with ICi = |w; 



1=1^ 



A 



[1,1 + 1)}. 



We have 



E{le\yr} = E{W|yr} + E{We|yr}< (Au)^«.Pr(£) + ^E{WHyr} 

oo 

xP''-Pr(£-) + ^(/ + l)f«.E{W?} 

OO 

< AP'^- u''-PT{£) + J2il + ^r''-n'^W;\yiu.)\>Al}] 



< AP'i ■ 
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where in the last hne we used <u smce u>l and < 1. Usmg the hypothesis (|lQQp . we continue 

oo 

E{lf: \yY"i} < AP'i ■ [u« • Pr(£:) + + 1)^' exp(-0 . 



Upp er bounding the discrete sum by a continuous integral, we recognize here the incomplete Gamma func- 
tion 



GautschI 119981 ] ■ 

oo oo /-l+i 

^(Z + lf'e-' -XI/ {l + ^y^e-UtK^ (t+ l)P'?e-(*+^'+*+i-'di 

oo /. / + 1 /. oc> 

^ e^V / (t + ire-(*+i)dt = e2 / (i + l)f9e-(*+i)dt 

/•OO /'OO 

= eM <P'?e-*dt<eM t«e-*(ii = e^F + 1, m) 



where again we used t'P'^ < for t > 1. A standard formula [see equation (1.3) in iGautschil . |1998| leads to, 
for li > 1, 



T{q + 1, u) = q\ exp(— It) — < e (j! exp(— u)u'' 



J! 



Putting all the pieces together we thus reach the advertised conclusion. 



□ 



Corollary 4. Consider n independent draws satisfying the hypothesis (jlOOp . Consider also n 

independent events }ie[i:n| defined on the same probability space, with maxig[i.„] Pr(£') < k < 1. Then, 
for any < p < 1 and < r < y/nK,, we have 



E{ls.\yr}<2AP-{3^\ogK)-K 



(103) 



P^(^\l^J2{U\yr~m£AyT})\>8AP-i^-^ogK)-^ < exp(-r2) (104) 

Proof. Applying Lemma [^^Equation (|10ip with u = 3 — log k for g = 1 we obtain (|103p where we used 



that Pr(£'' 



< K + e'^ " = 2k. Similarly, applying Lemma [^H-Equation (|102p for q > 2, we can 



apply Lemma [2T] with z, = If-lj/T - E{l£.|y*|P}, M = 2APu and <r = V2MVk + e^-" = 2MV;^ 



4AP(3 - logK)V^. This shows that for < r < 



we have (|104p . 



□ 
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